--- license: mit tags: - clip - feature-extraction - remote-sensing --- # GeoRSCLIP-ViT-B-32 This model is a mirror/redistribution of the original [GeoRSCLIP](https://huggingface.co/Zilun/GeoRSCLIP) model. ## Original Repository and Links - **Original Hugging Face Model**: [Zilun/GeoRSCLIP](https://huggingface.co/Zilun/GeoRSCLIP) - **Official GitHub Repository**: [om-ai-lab/RS5M](https://github.com/om-ai-lab/RS5M) ## Description GeoRSCLIP is a vision-language foundation model for remote sensing, trained on a large-scale dataset of remote sensing image-text pairs (RS5M). It is based on the CLIP architecture and is designed to handle the unique characteristics of remote sensing imagery. ## How to use ### With `transformers` ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch # Load model and processor model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") # Load and process image image = Image.open("path/to/your/image.jpg") inputs = processor( text=["a photo of a building", "a photo of vegetation", "a photo of water"], images=image, return_tensors="pt", padding=True ) # Get image-text similarity scores with torch.inference_mode(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) print(f"Similarity scores: {probs}") ``` **Zero-shot image classification:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") # Define candidate labels candidate_labels = [ "a satellite image of urban area", "a satellite image of forest", "a satellite image of agricultural land", "a satellite image of water body" ] image = Image.open("path/to/your/image.jpg") inputs = processor( text=candidate_labels, images=image, return_tensors="pt", padding=True ) with torch.inference_mode(): outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) # Get the predicted label predicted_idx = probs.argmax().item() print(f"Predicted label: {candidate_labels[predicted_idx]}") print(f"Confidence: {probs[0][predicted_idx]:.4f}") ``` **Extracting individual features:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-B-32") # Get image features only image = Image.open("path/to/your/image.jpg") image_inputs = processor(images=image, return_tensors="pt") with torch.inference_mode(): image_features = model.get_image_features(**image_inputs) # Get text features only text_inputs = processor( text=["a satellite image of urban area"], return_tensors="pt", padding=True, truncation=True ) with torch.inference_mode(): text_features = model.get_text_features(**text_inputs) print(f"Image features shape: {image_features.shape}") print(f"Text features shape: {text_features.shape}") ``` ### With `diffusers` This model's text encoder can be used with Stable Diffusion and other diffusion models: ```python from diffusers import StableDiffusionPipeline from transformers import CLIPTextModel, CLIPTokenizer import torch # Load the text encoder and tokenizer text_encoder = CLIPTextModel.from_pretrained( "BiliSakura/GeoRSCLIP-ViT-B-32/diffusers", subfolder="text_encoder", torch_dtype=torch.float16 ) tokenizer = CLIPTokenizer.from_pretrained( "BiliSakura/GeoRSCLIP-ViT-B-32" ) # Encode text prompt prompt = "a satellite image of a city with buildings and roads" text_inputs = tokenizer( prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt" ) with torch.inference_mode(): text_outputs = text_encoder(text_inputs.input_ids) text_embeddings = text_outputs.last_hidden_state print(f"Text embeddings shape: {text_embeddings.shape}") ``` **Using with Stable Diffusion:** ```python from diffusers import StableDiffusionPipeline import torch # Load pipeline with custom text encoder pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.float16 ) pipe = pipe.to("cuda") # Generate image prompt = "a high-resolution satellite image of urban area" image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] image.save("generated_image.png") ``` ## Citation If you use this model in your research, please cite the original work: ```bibtex @article{zhangRS5MGeoRSCLIPLargeScale2024, title = {RS5M} and {GeoRSCLIP}: {A Large-Scale Vision-Language Dataset} and a {Large Vision-Language Model} for {Remote Sensing}, shorttitle = {RS5M} and {GeoRSCLIP}, author = {Zhang, Zilun and Zhao, Tiancheng and Guo, Yulong and Yin, Jianwei}, year = {2024}, journal = {TGRS}, volume = {62}, pages = {1--23}, issn = {1558-0644}, doi = {10.1109/TGRS.2024.3449154}, urldate = {2024-12-15}, keywords = {Computational modeling,Data models,Domain VLM (DVLM),general VLM (GVLM),image-text paired dataset,Location awareness,parameter efficient tuning,Remote sensing,remote sensing (RS),RS cross-modal text-image retrieval (RSCTIR),semantic localization (SeLo),Semantics,Tuning,vision-language model (VLM),Visualization,zero-shot classification (ZSC)} } ```