Instructions to use TinyLlama/tinyLlama-intermediate-checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TinyLlama/tinyLlama-intermediate-checkpoints with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/tinyLlama-intermediate-checkpoints")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/tinyLlama-intermediate-checkpoints")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/tinyLlama-intermediate-checkpoints")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TinyLlama/tinyLlama-intermediate-checkpoints with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TinyLlama/tinyLlama-intermediate-checkpoints"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TinyLlama/tinyLlama-intermediate-checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TinyLlama/tinyLlama-intermediate-checkpoints

SGLang

How to use TinyLlama/tinyLlama-intermediate-checkpoints with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TinyLlama/tinyLlama-intermediate-checkpoints" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TinyLlama/tinyLlama-intermediate-checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TinyLlama/tinyLlama-intermediate-checkpoints" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TinyLlama/tinyLlama-intermediate-checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TinyLlama/tinyLlama-intermediate-checkpoints with Docker Model Runner:
```
docker model run hf.co/TinyLlama/tinyLlama-intermediate-checkpoints
```

[Error] Weights only load failed

by DrNicefellow - opened Feb 11, 2024

Discussion

DrNicefellow

Feb 11, 2024

I ran into the following error during loading from pretrained.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 519, in load_state_dict
return torch.load(checkpoint_file, map_location=map_location, weights_only=True)
File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 1013, in load
raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 149

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 523, in load_state_dict
if f.read(7) == "version":
File "/opt/conda/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/src/models/Tinyllamas/test_generation.py", line 114, in
test_generation(texts,model_name,model_name)
File "/src/models/Tinyllamas/test_generation.py", line 13, in test_generation
model = AutoModelForCausalLM.from_pretrained(model_name,load_in_4bit=load_in_4bit,trust_remote_code=True)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3393, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 535, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for './step-10k-token-21B/pytorch_model.bin' at './step-10k-token-21B/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

LuoMoer

Nov 24, 2025

This issue may be caused by the version of transformers.
I solved it in the following way:

Switch the transformers version in your environment to match the version specified in the config.json inside the checkpoint.
Then run the following code:

tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
model = AutoModelForCausalLM.from_pretrained(ckpt_path)

model.save_pretrained(ckpt_path_save)
tokenizer.save_pretrained(ckpt_path_save)

After that, switch the transformers version back to your original version, and you should be able to load the checkpoint normally.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment