vLLM says it is not a safetensors repo

#5
by gnoale - opened

Hello, while loading the model with vLLM (v0.12.0) it surface an error coming from the hf_hub library

(APIServer pid=1) ERROR 12-24 13:41:28 [transformers_utils/repo_utils.py:65] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files., retrying 1 of 2
(APIServer pid=1) ERROR 12-24 13:41:30 [transformers_utils/repo_utils.py:63] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files.

The library only check for (model.safetensors.index.json)[https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/constants.py#L48] and this repository has only a consolidated.safetensors.index.json

what do you think ?

gnoale changed discussion title from vLLM cannot find consolidated.safetensors.index.json to vLLM says it is not a safetensors repo

yes, i would suggest the model authors use the conventional filenames if possible!

cc @patrickvonplaten just in case!

Mistral AI_ org

Hey does the model gets loaded or can't you even have access ?

The naming of the files is correct, we can have two formats in our repositories:

  • consolidated => our internal format which is basically the same as vLLM
  • model => this is the Transformers format

We didn't release ML3 in a Transformers format.

Hi,

Indeed vLLM cannot serve it on this particular hardware, 8x H100-SXM, it returns torch.OutOfMemoryError: CUDA out of memory
although it has 640GB of vRAM available, whatever the configured context, for instance --max-model-len 1000

With the patch to read the consolidated.safetensors.index.json vLLM loads correctly

(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1286] GPU KV cache size: 188,800 tokens
(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1291] Maximum concurrency for 1,000 tokens per request: 184.38x

My guess is it fallback to the default "pickle" mode and is reading too much in memory

@gnoale what is the patch? Can't run this on 96GB VRAM, vllm 15.1.

With the patch to read the consolidated.safetensors.index.json vLLM loads correctly

Mistral AI_ org

Hey since we merged this
https://github.com/vllm-project/vllm/pull/33253
The issue should be solved :)

Sign up or log in to comment