vLLM says it is not a safetensors repo

by gnoale - opened Dec 24, 2025

Dec 24, 2025

Hello, while loading the model with vLLM (v0.12.0) it surface an error coming from the hf_hub library

(APIServer pid=1) ERROR 12-24 13:41:28 [transformers_utils/repo_utils.py:65] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files., retrying 1 of 2
(APIServer pid=1) ERROR 12-24 13:41:30 [transformers_utils/repo_utils.py:63] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files.

The library only check for (model.safetensors.index.json)[https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/constants.py#L48] and this repository has only a consolidated.safetensors.index.json

what do you think ?

gnoale changed discussion title from vLLM cannot find consolidated.safetensors.index.json to vLLM says it is not a safetensors repo Dec 24, 2025

julien-c

Dec 26, 2025

yes, i would suggest the model authors use the conventional filenames if possible!

julien-c

Jan 5

cc @patrickvonplaten just in case!

juliendenize

Mistral AI_ org Jan 13

Hey does the model gets loaded or can't you even have access ?

The naming of the files is correct, we can have two formats in our repositories:

consolidated => our internal format which is basically the same as vLLM
model => this is the Transformers format

We didn't release ML3 in a Transformers format.

gnoale

Jan 14

•

edited Jan 14

Hi,

Indeed vLLM cannot serve it on this particular hardware, 8x H100-SXM, it returns torch.OutOfMemoryError: CUDA out of memory
although it has 640GB of vRAM available, whatever the configured context, for instance --max-model-len 1000

With the patch to read the consolidated.safetensors.index.json vLLM loads correctly

(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1286] GPU KV cache size: 188,800 tokens
(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1291] Maximum concurrency for 1,000 tokens per request: 184.38x

My guess is it fallback to the default "pickle" mode and is reading too much in memory

akisviete

12 days ago

•

edited 12 days ago

@gnoale what is the patch? Can't run this on 96GB VRAM, vllm 15.1.

With the patch to read the consolidated.safetensors.index.json vLLM loads correctly

juliendenize

Mistral AI_ org 12 days ago

Hey since we merged this
https://github.com/vllm-project/vllm/pull/33253
The issue should be solved :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment