Instructions to use google/gemma-2-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-2-9b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-2-9b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b") model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-2-9b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-2-9b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-2-9b
- SGLang
How to use google/gemma-2-9b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-2-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-2-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use google/gemma-2-9b with Docker Model Runner:
docker model run hf.co/google/gemma-2-9b
Why does this model use left-padding by default?
I have a question about padding in tokenizers. By default, padding tokens are added to the left (start) of the sequence, unless we set padding_side='right' when loading the tokenizer.
Since LLMs process text from left to right, wouldn't having padding tokens at the start potentially affect how the model reads the actual content? I'm trying to understand why this is the default setting.
Also, does anyone know if Gemma-2 models were trained with this left-padding approach?
Hi @smbslt3 ,
Large Language Models are decoder-only architectures, during inference left-padding (padding_side='left') is often preferred. This is because many LLMs are trained to predict the next token based on preceding context. If padding tokens are on the right, the model might generate outputs that include or are influenced by these padding tokens, leading to incorrect results. Left-padding aligns the input such that the model processes the meaningful tokens in their intended order, improving the quality of the generated text. For more details, could you please refer to this link.
does anyone know if Gemma-2 models were trained with this left-padding approach? ==> not explicitly documented in any resources.
Thank you.
Small nitpick but Large Language Models are not by definition decoder-only architectures. Sure, many popular ones are but you can also have an encoder-decoder LLM like OpenAI Whisper or the original Seq2Seq model from the Attention Is All You Need paper.