Why does this model use left-padding by default?

#42

by smbslt3 - opened Jan 20, 2025

Jan 20, 2025

I have a question about padding in tokenizers. By default, padding tokens are added to the left (start) of the sequence, unless we set padding_side='right' when loading the tokenizer.
Since LLMs process text from left to right, wouldn't having padding tokens at the start potentially affect how the model reads the actual content? I'm trying to understand why this is the default setting.
Also, does anyone know if Gemma-2 models were trained with this left-padding approach?

smbslt3 changed discussion title from Why does this model 'left padded'? to Why does this model use left-padding by default? Jan 20, 2025

GopiUppari

Google org Jan 21, 2025

•

edited Jan 21, 2025

Hi @smbslt3 ,

Large Language Models are decoder-only architectures, during inference left-padding (padding_side='left') is often preferred. This is because many LLMs are trained to predict the next token based on preceding context. If padding tokens are on the right, the model might generate outputs that include or are influenced by these padding tokens, leading to incorrect results. Left-padding aligns the input such that the model processes the meaningful tokens in their intended order, improving the quality of the generated text. For more details, could you please refer to this link.

does anyone know if Gemma-2 models were trained with this left-padding approach? ==> not explicitly documented in any resources.

Thank you.

Naataan

Feb 28, 2025

Small nitpick but Large Language Models are not by definition decoder-only architectures. Sure, many popular ones are but you can also have an encoder-decoder LLM like OpenAI Whisper or the original Seq2Seq model from the Attention Is All You Need paper.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment