Gemma 3 4B T1-it GGUF Collection

GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.

Gemma3-4B-T1-it

About

Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.

Available Models

Model Size Use Case
twinkle-ai-gemma-3-4B-T1-it-BF16.gguf Largest Best quality, highest precision
twinkle-ai-gemma-3-4B-T1-it-F16.gguf Large High quality, good precision
twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf Medium Balanced quality and speed
twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf Smallest Fastest inference, lower memory

Quick Start

Option 1: Using Hugging Face Hub (Recommended)

Install llama.cpp via Homebrew:

brew install llama.cpp

Run inference directly from Hugging Face:

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Start as a server:

llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Option 2: Build from Source

Step 1: Clone llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Step 2: Build llama.cpp

Basic build (CPU only):

LLAMA_CURL=1 make

Hardware-specific build options:

  • NVIDIA GPU (Linux):

    LLAMA_CUDA=1 LLAMA_CURL=1 make
    
  • Apple Silicon (Mac):

    LLAMA_METAL=1 LLAMA_CURL=1 make
    
  • AMD GPU (ROCm):

    LLAMA_HIPBLAS=1 LLAMA_CURL=1 make
    

Step 3: Run inference

./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Step 4: Start server (optional)

./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Advanced Usage

Choosing the Right Model

Select a model based on your needs:

  • Best Quality: Use BF16 or F16 versions (requires more memory)
  • Balanced: Use Q8_0 version (recommended for most users)
  • Resource Constrained: Use q4_k_m version (suitable for devices with limited memory)

Common Parameters

  • -p "prompt": Your input text for the model to respond to
  • -c 2048: Context length (maximum number of tokens that can be processed)
  • --hf-repo: Hugging Face repository name
  • --hf-file: Model file name to use

Adjusting Generation Parameters

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here" \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Parameter explanations:

  • --temp: Temperature (0.0-2.0), higher values produce more random output
  • --top-p: Nucleus sampling parameter (0.0-1.0)
  • --repeat-penalty: Repetition penalty to avoid repetitive content

Model Information

  • Base Model: twinkle-ai/gemma-3-4B-T1-it
  • Languages: English, Traditional Chinese
  • License: Gemma
  • Format: GGUF (converted via GGUF-my-repo)

Training Data

  • Taiwan reasoning and instruction datasets
  • Contract review and legal documents
  • Multimodal and long-form content
  • Instruction-following examples

Benchmarks

  • TMMLU+: 47.44% accuracy
  • MMLU: 59.13% accuracy
  • TW Legal Benchmark: 44.18% accuracy

Troubleshooting

Common Issues

Q: Getting out of memory errors?

A: Try using a smaller quantized version like q4_k_m, or reduce the context length parameter -c.

Q: How can I speed up inference?

A:

  1. Use GPU acceleration (add hardware-specific flags during compilation)
  2. Choose a smaller quantized model (like q4_k_m)
  3. Reduce context length

Q: What prompt format does the model support?

A: This is an instruction-tuned model. Use a clear instruction format, for example:

Please analyze the main clauses of the following contract: [contract content]

Links

Contributing

If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.


Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.

Downloads last month
2,368
GGUF
Model size
5B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for twinkle-ai/gemma-3-4B-T1-it-GGUF

Quantized
(4)
this model

Datasets used to train twinkle-ai/gemma-3-4B-T1-it-GGUF

Collection including twinkle-ai/gemma-3-4B-T1-it-GGUF

Evaluation results