vLLM says it is not a safetensors repo
Hello, while loading the model with vLLM (v0.12.0) it surface an error coming from the hf_hub library
(APIServer pid=1) ERROR 12-24 13:41:28 [transformers_utils/repo_utils.py:65] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files., retrying 1 of 2
(APIServer pid=1) ERROR 12-24 13:41:30 [transformers_utils/repo_utils.py:63] Error retrieving safetensors: 'mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files.
The library only check for (model.safetensors.index.json)[https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/constants.py#L48] and this repository has only a consolidated.safetensors.index.json
what do you think ?
yes, i would suggest the model authors use the conventional filenames if possible!
Hey does the model gets loaded or can't you even have access ?
The naming of the files is correct, we can have two formats in our repositories:
- consolidated => our internal format which is basically the same as vLLM
- model => this is the Transformers format
We didn't release ML3 in a Transformers format.
Hi,
Indeed vLLM cannot serve it on this particular hardware, 8x H100-SXM, it returns torch.OutOfMemoryError: CUDA out of memory
although it has 640GB of vRAM available, whatever the configured context, for instance --max-model-len 1000
With the patch to read the consolidated.safetensors.index.json vLLM loads correctly
(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1286] GPU KV cache size: 188,800 tokens
(EngineCore_DP0 pid=1000) INFO 12-24 15:05:20 [v1/core/kv_cache_utils.py:1291] Maximum concurrency for 1,000 tokens per request: 184.38x
My guess is it fallback to the default "pickle" mode and is reading too much in memory
@gnoale what is the patch? Can't run this on 96GB VRAM, vllm 15.1.
With the patch to read the
consolidated.safetensors.index.jsonvLLM loads correctly
Hey since we merged this
https://github.com/vllm-project/vllm/pull/33253
The issue should be solved :)