Gemma 3 4B T1-it GGUF Collection

GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.

About

Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.

Available Models

Model	Size	Use Case
`twinkle-ai-gemma-3-4B-T1-it-BF16.gguf`	Largest	Best quality, highest precision
`twinkle-ai-gemma-3-4B-T1-it-F16.gguf`	Large	High quality, good precision
`twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf`	Medium	Balanced quality and speed
`twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf`	Smallest	Fastest inference, lower memory

Quick Start

Option 1: Using Hugging Face Hub (Recommended)

Install llama.cpp via Homebrew:

brew install llama.cpp

Run inference directly from Hugging Face:

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Start as a server:

llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Option 2: Build from Source

Step 1: Clone llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Step 2: Build llama.cpp

Basic build (CPU only):

LLAMA_CURL=1 make

Hardware-specific build options:

NVIDIA GPU (Linux):
```
LLAMA_CUDA=1 LLAMA_CURL=1 make
```
Apple Silicon (Mac):
```
LLAMA_METAL=1 LLAMA_CURL=1 make
```
AMD GPU (ROCm):
```
LLAMA_HIPBLAS=1 LLAMA_CURL=1 make
```

Step 3: Run inference

./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Step 4: Start server (optional)

./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Advanced Usage

Choosing the Right Model

Select a model based on your needs:

Best Quality: Use BF16 or F16 versions (requires more memory)
Balanced: Use Q8_0 version (recommended for most users)
Resource Constrained: Use q4_k_m version (suitable for devices with limited memory)

Common Parameters

-p "prompt": Your input text for the model to respond to
-c 2048: Context length (maximum number of tokens that can be processed)
--hf-repo: Hugging Face repository name
--hf-file: Model file name to use

Adjusting Generation Parameters

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here" \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Parameter explanations:

--temp: Temperature (0.0-2.0), higher values produce more random output
--top-p: Nucleus sampling parameter (0.0-1.0)
--repeat-penalty: Repetition penalty to avoid repetitive content

Model Information

Base Model: twinkle-ai/gemma-3-4B-T1-it
Languages: English, Traditional Chinese
License: Gemma
Format: GGUF (converted via GGUF-my-repo)

Training Data

Taiwan reasoning and instruction datasets
Contract review and legal documents
Multimodal and long-form content
Instruction-following examples

Benchmarks

TMMLU+: 47.44% accuracy
MMLU: 59.13% accuracy
TW Legal Benchmark: 44.18% accuracy

Troubleshooting

Common Issues

Q: Getting out of memory errors?

A: Try using a smaller quantized version like q4_k_m, or reduce the context length parameter -c.

Q: How can I speed up inference?

Use GPU acceleration (add hardware-specific flags during compilation)
Choose a smaller quantized model (like q4_k_m)
Reduce context length

Q: What prompt format does the model support?

A: This is an instruction-tuned model. Use a clear instruction format, for example:

Please analyze the main clauses of the following contract: [contract content]

Contributing

If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.

Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.

Downloads last month: 2,368

GGUF

Model size

5B params

Architecture

gemma3

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for twinkle-ai/gemma-3-4B-T1-it-GGUF

Base model

google/gemma-3-4b-pt

Finetuned

twinkle-ai/gemma-3-4B-T1-it

Quantized

(4)

this model

Datasets used to train twinkle-ai/gemma-3-4B-T1-it-GGUF

Collection including twinkle-ai/gemma-3-4B-T1-it-GGUF

🌟 T1 Series

Collection

Instruction-tuned Gemma-3 models optimized for agentic workflows in Traditional Chinese. • 6 items • Updated 24 days ago • 3

Evaluation results

single choice on tmmlu+
test set self-reported

47.440
single choice on mmlu
test set self-reported

59.130
single choice on tw-legal-benchmark-v1
test set self-reported

44.180

twinkle-ai
/

gemma-3-4B-T1-it-GGUF