CLIP ViT-B/32 LAION — ONNX INT8
INT8-quantized ONNX export of laion/CLIP-ViT-B-32-laion2B-s34B-b79K
Optimized for CPU-only inference (no GPU required at serving time).
Files
| File | Description |
|---|---|
vision_encoder_int8.onnx |
Vision (image) encoder — INT8 quantized |
text_encoder_int8.onnx |
Text encoder — INT8 quantized |
projections.npy |
Visual + text projection weights (FP32) |
tokenizer_config.json etc. |
Processor / tokenizer config |
Performance vs original
| FP32 original | ONNX INT8 | |
|---|---|---|
| Disk | ~600 MB | ~150 MB |
| RAM | ~1.8 GB | ~500 MB |
| Image embed (CPU) | ~800 ms | ~200 ms |
| Text embed (CPU) | ~300 ms | ~80 ms |
Quick start
from huggingface_hub import snapshot_download
import onnxruntime as ort
import numpy as np
from transformers import CLIPProcessor
from PIL import Image
model_dir = snapshot_download("rdxtremity/clip-laion-b32-onnx-int8")
processor = CLIPProcessor.from_pretrained(model_dir)
proj = np.load(f"{model_dir}/projections.npy", allow_pickle=True).item()
vision_sess = ort.InferenceSession(f"{model_dir}/vision_encoder_int8.onnx")
text_sess = ort.InferenceSession(f"{model_dir}/text_encoder_int8.onnx")
# Embed an image
img = Image.open("product.jpg").convert("RGB")
inp = processor(images=img, return_tensors="np")
out = vision_sess.run(["pooler_output"], {"pixel_values": inp["pixel_values"].astype(np.float32)})
vec = out[0] @ proj["visual_projection"].T
vec /= np.linalg.norm(vec)
# Embed text (Arabic + English supported)
inp = processor(text="حذاء رياضي أحمر", return_tensors="np", padding="max_length", truncation=True, max_length=77)
out = text_sess.run(["pooler_output"], {"input_ids": inp["input_ids"].astype(np.int64), "attention_mask": inp["attention_mask"].astype(np.int64)})
vec = out[0] @ proj["text_projection"].T
vec /= np.linalg.norm(vec)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for rdxtremity/clip-laion-b32-onnx-int8
Base model
laion/CLIP-ViT-B-32-laion2B-s34B-b79K