Instructions to use bigscience/bloom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloom")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") model = AutoModelForCausalLM.from_pretrained("bigscience/bloom") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigscience/bloom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloom
- SGLang
How to use bigscience/bloom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloom with Docker Model Runner:
docker model run hf.co/bigscience/bloom
Hardware Requirements for CPU / GPU Inference
I was looking and couldn't find any recommendations for the required hardware to run this model in inference on the CPU or GPU.
I'm going to test it out but some guidance would be pretty helpful.
Does anyone have this data? Particularly, how much RAM for CPU, and amount of GPU RAM (I've seen some threads saying ~352GB). Also, perhaps what kind of inference times can be expected with different setups.
Copying some data I found from other threads here:
It needed around 400GB [disk space] just to fit the all the weights files. They list the sizes of the weights and checkpoints under the Training section.
I have successfully loaded it on a single x2iezn.6xlarge instance in AWS but using only CPUs the model is very slow. Text generation sampling for several sequences can take several minutes to return, but the full model is working and it is much cheaper for local evaluation than 9 GPUs!
x2iezn.6xlarge specs:
- 768gb RAM
- 24 vcpus
- $5.004 / hour
As a first order estimate, 176B parameters in half precision (16 bits = 2 bytes) would need 352 GB RAM. But since some modules are 32-bit, it would be more. So about nine GPUs with 40-GB RAM, and it doesn't take into account the input.
GPU RAM requires more than 352 GB RAM (176B parameters in half-precision). I can do the inference on 8 A6000 GPUs. However, there isn't much room left for input tokens.
Copying some data I found from other threads here:...
Thanks for this, very helpful, was looking for the same information. No wonder I am failing to run the full model on a 64GB VM. ;)
Have you come across any recommendations anywhere to reduce memory usage, say, for specific pipeline tasks?
@bwv988 Your best bet is to try out bitsandbytes. https://github.com/TimDettmers/bitsandbytes
This configuration claims to run on >16 GB RAM and a single CPU: