Instructions to use HuggingFaceTB/SmolLM-135M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolLM-135M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM-135M-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceTB/SmolLM-135M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolLM-135M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolLM-135M-Instruct

SGLang

How to use HuggingFaceTB/SmolLM-135M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolLM-135M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolLM-135M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolLM-135M-Instruct with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolLM-135M-Instruct
```

What do you recommend to use this model for?

by BoscoTheDog - opened Jul 16, 2024

Discussion

BoscoTheDog

Jul 16, 2024

What is it's intended use?

NoidoDev

Jul 31, 2024

•

edited Jul 31, 2024

Yeah, it got everything wrong I tested it with. I guess it's a basis to finetune it into something better.

loubnabnl

Hugging Face Smol Models Research org Aug 18, 2024

Hi, we just updated the Instruct models and the outputs should be better, you can also try the larger 360M for better performance in these demos
https://huggingface.co/spaces/HuggingFaceTB/instant-smollm
https://huggingface.co/spaces/HuggingFaceTB/SmolLM-360M-Instruct-WebGPU

NoidoDev

Aug 18, 2024

Thanks, but the idea was to check what such small models can do. I guess the whole approach shows that such small models have severe limitations . Maybe it can only be fine-tuned to be useful in some specific area. It certainly doesn't get simple facts about the world right, the 360M does way better. I wouldn't bother with making it learn dates and such, just giving vague answers.

BoscoTheDog

Aug 18, 2024

•

edited Aug 20, 2024

Danube 3 500M actually is surprisingly good for it's size. Quantized to Q4 it's just 320MB. It's like talking to a basic version of Wikipedia. It outputs markdown too, which is a big plus. It's a very interesting development.

NoidoDev

Aug 19, 2024

Thanks, I'll try that out. Did you ever try to fine tune one. They can't need that much GPU. I think that for example that such small models should return Python code when asked a math question, then another part of a system could execute that and return the values. Also, they don't need all knowledge of Wikipedia, but then the answers should be vague and maybe also indicating some insecurity.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment