Update README.md

8ccbfa1 verified about 2 months ago

3.43 kB

tags:
  - sentence-transformers
  - embeddings
  - litert
  - tflite
  - edge
  - on-device
license: mit
base_model: BAAI/bge-small-en-v1.5
pipeline_tag: feature-extraction

bge-small-en-v1.5 - LiteRT

This is a LiteRT (formerly TensorFlow Lite) conversion of BAAI/bge-small-en-v1.5 for efficient on-device inference.

Model Details

Property	Value
Original Model	BAAI/bge-small-en-v1.5
Format	LiteRT (.tflite)
File Size	127.2 MB
Task	Sentence Embeddings / Retrieval
Max Sequence Length	512
Output Dimension	384
Pooling Mode	CLS Token Pooling

Performance

Benchmarked on AMD CPU (WSL2):

Metric	Value
Inference Latency	100.2 ms
Throughput	10.0/sec
Cosine Similarity vs Original	1.0000 ✅

Quick Start

import numpy as np
from ai_edge_litert.interpreter import Interpreter
from transformers import AutoTokenizer

# Load model and tokenizer
interpreter = Interpreter(model_path="BAAI_bge-small-en-v1.5.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")

def get_embedding(text: str) -> np.ndarray:
    """Get sentence embedding for input text."""
    encoded = tokenizer(
        text,
        padding="max_length",
        max_length=512,
        truncation=True,
        return_tensors="np"
    )

    interpreter.set_tensor(input_details[0]["index"], encoded["input_ids"].astype(np.int64))
    interpreter.set_tensor(input_details[1]["index"], encoded["attention_mask"].astype(np.int64))
    interpreter.invoke()

    return interpreter.get_tensor(output_details[0]["index"])[0]

# Example
embedding = get_embedding("Hello, world!")
print(f"Embedding shape: {embedding.shape}")  # (384,)

Files

BAAI_bge-small-en-v1.5.tflite - The LiteRT model file

Conversion Details

Conversion Tool: ai-edge-torch
Conversion Date: 2026-01-12
Source Framework: PyTorch → LiteRT
Validation: Cosine similarity 1.0000 vs original

Intended Use

Mobile Applications: On-device semantic search, RAG systems
Edge Devices: IoT, embedded systems, Raspberry Pi
Offline Processing: Privacy-preserving inference
Low-latency Applications: Real-time processing

Limitations

Fixed sequence length (512 tokens)
CPU inference (GPU delegate requires setup)
Tokenizer loaded separately from original model
Float32 precision

License

This model inherits the license from the original:

License: MIT (source)

Citation

@misc{bge_embedding,
    title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
    author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
    year={2023},
    eprint={2309.07597},
    archivePrefix={arXiv},
}

Acknowledgments

Original model by BAAI
Conversion using ai-edge-torch

Converted by Bombek1