Text Generation
Transformers
GGUF
conversational

Sampling Parameters: For optimal performance, we recommend using temperatures close to zero (0 - 0.2). Additionally, we advise against using any type of repetition penalty, as from our experience, it negatively impacts instructed model's responses.

ALIA-40b-instruct-GGUF

Description

This repo contains GGUF format model files for ALIA-40b-instruct-2601.

Quantization

Model weights were exported to GGUF in FP16 first, then quantized with llama.cpp’s llama-quantize into the target preset (e.g., Q8). The same conversion + quantization pipeline was executed through quantool from a single YAML config (e.g., method: gguf, quant_level, and method-specific quantization_config) and run via quantool config.yaml.

About GGUF

GGUF is the model file format introduced by the llama.cpp team on August 21st, 2023, replacing the older GGML format (now deprecated). It brings significant improvements such as enhanced tokenization, proper handling of special tokens, embedded metadata (e.g., architecture, quantization type, tokenizer), and an extensible design for future compatibility.

How to use

Deployment as service and remote use (Messages API)

In our experience, vllm works well for deploying the full unquantized version of the model, whereas llama.cpp is appropriate for the quantized (GGUF) version. We strongly discourage using ollama as we have encountered compatibility issues that may seriously degrade the model's performance.

The easiest and most reliable way to have a working deployment of ALIA-40b-instruct is through the "Deploy / HF Inference Endpoints" option directly on the Hugging Face model page. This automatically creates a functioning endpoint, using vllm or llama.cpp according to the model variant, with an appropriately dimensioned GPU. While there are additional settings available for the endpoint we found the standard configuration proposed by Hugging Face to be a reasonable starting point.

Once the endpoint is running, the model can be easily called using OpenAI's "Messages API" (the de facto standard API for LLM use). By using this API the chat template is applied automatically by the service, requiring no explicit configuration on the client side. The endpoint's configuration page on Hugging Face also provides a "Playground" for testing and API examples, as well as a simple chat interface.

Example usage:

# pip install openai 

from openai import OpenAI 

client = OpenAI(
    base_url = YOUR_ENDPOINT_URL,
    api_key = "$HF_TOKEN"
)

chat_completion = client.chat.completions.create(
    model = "BSC-LT/ALIA-40b-instruct-2601-GGUF",
    messages = [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    stream = True,
    max_tokens = 1000,
  temperature=0.1
)

print(chat_completion.choices[0].message.content)

The model can also be deployed locally or on any server infrastructure with sufficient GPUs, using vllm or llama.cpp. We recommend an initial deployment on Hugging Face as a point of reference and comparison to make sure the model is behaving as expected in the desired deployment setup.

To check that your endpoint is working correctly, you can try to replicate the examples contained in this Colab Notebook


Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

This work has been promoted and supported by the Government of Catalonia through the Aina Project.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

We are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. Many other institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà. We thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, especially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipe Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

Apache License, Version 2.0

Model Index

Model Base Instruct
2b Link Link
7b Link Link
40b Link Link
Downloads last month
37
GGUF
Model size
40B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for langtech-languagemodeling/ALIA-40b-instruct-2601-GGUF

Base model

BSC-LT/ALIA-40b
Quantized
(5)
this model

Datasets used to train langtech-languagemodeling/ALIA-40b-instruct-2601-GGUF

Paper for langtech-languagemodeling/ALIA-40b-instruct-2601-GGUF