Capri

Capri is a compact image captioning model designed for high-throughput, plain-language descriptions. It supports two inference paths: direct image input or precomputed SigLIP2 pooled embeddings.

The project started from a practical pipeline constraint: existing captioning models were either too slow or too weak for reliable image understanding. That constraint sparked the idea for Capri: since SigLIP embeddings were already computed upstream, why not pair them with a small LLM decoder and get both strong visual representations and fast text generation?

The name comes from the small Italian island of Capri and also hints at the goal of the project: a small CAPtioner with Rapid Inference.

Model Architecture

Vision encoder: google/siglip2-base-patch16-224 (pooled embeddings)
Projector: MLP 768 -> 3072 -> 896
Decoder: Qwen/Qwen2.5-0.5B
Adaptation: LoRA on q_proj and v_proj

Load Modes

Embedding-only mode keeps SigLIP out of downloads and VRAM:

from transformers import AutoModel, AutoProcessor
import torch

processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Ligul/capri",
    trust_remote_code=True,
    load_vision_tower=False,
    torch_dtype=torch.bfloat16,
)

inputs = processor(
    pooled_embeddings=torch.randn(2, 768),
    return_tensors="pt",
)
captions = model.generate_captions(
    pooled_embeddings=inputs["pooled_embeddings"],
    processor=processor,
    max_new_tokens=32,
    decode_batch_size=2048,
)

Image mode loads SigLIP lazily:

from PIL import Image
from transformers import AutoModel, AutoProcessor
import torch

processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Ligul/capri",
    trust_remote_code=True,
    load_vision_tower=True,
    torch_dtype=torch.bfloat16,
)

image = Image.open("example.jpg").convert("RGB")
captions = model.generate_captions(
    images=[image],
    processor=processor,
    max_new_tokens=32,
    vision_batch_size=64,
    decode_batch_size=2048,
)

generate() is still available for low-level token generation if you want raw token ids.

Batch Guidance

Use different knobs for the two stages:

vision_batch_size: moderate, image preprocessing + SigLIP is the expensive vision pass
decode_batch_size: much larger, pooled embeddings are tiny and Qwen generation batches well

Reasonable defaults:

vision_batch_size=64
decode_batch_size=1024

On larger GPUs, decode often scales to 2048+.

Attribution

Trained on captions from the COCO 2017 dataset.

Images sourced from Flickr under their respective licenses; the dataset as a whole is not cleared for unrestricted commercial use

Lin, T.-Y., et al. "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312

Built on top of:

Qwen/Qwen2.5-0.5B - Apache 2.0
google/siglip2-base-patch16-224 - Apache 2.0

Downloads last month: 96

Dataset used to train Ligul/capri

Paper for Ligul/capri

Microsoft COCO: Common Objects in Context

Paper • 1405.0312 • Published May 1, 2014 • 1