Qwen3-14B-f16-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-14B language model — a 14-billion-parameter LLM with deep reasoning, research-grade accuracy, and autonomous workflows. Converted for use with \llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.

NEW: I have a custom model called Q3_HIFI, which is better than the standard Q3_K_M model. It is higher quality, smaller in size, and nearly the same speed as Q3_K_M.

It is listed under the 'f16' options because it's not an officially recognised type (at the moment).

Q3_HIFI

Pros:

  • 🏆 Best quality with lowest perplexity of 9.38 (1.6% better than Q3_K_M, 3.4% better than Q3_K_S)
  • 📦 Smaller than Q3_K_M (6.59 vs 6.81 GiB) while being significantly better quality
  • 🎯 Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere)
  • Slightly faster than Q3_K_M (85.58 vs 85.40 TPS)

Cons:

  • 🐢 Slower than Q3_K_S at 85.58 TPS (6.5% slower than Q3_K_S)
  • 🔧 Custom quantization may have less community support

Best for: Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio.

You can read more about how it compares to Q3_K_M and Q3_K_S here: Q3_Quantization_Comparison.md

You can also view a cross-model comparison of the Q3_HIFI type here.

Available Quantizations (from f16)

Level Speed Size Recommendation
Q2_K ⚡ Fastest 5.75 GB An excellent option but it failed the 'hello' test. Use with caution.
🥇 Q3_K_S ⚡ Fast 6.66 GB 🥇 Best overall model. Two first places and two 3rd places. Excellent results across the full temperature range.
🥉 Q3_K_M ⚡ Fast 7.32 GB 🥉 A good option - it came 1st and 3rd, covering both ends of the temperature range.
Q4_K_S 🚀 Fast 8.57 GB Not recommended, two 2nd places in low temperature questions with no other appearances.
Q4_K_M 🚀 Fast 9.00 GB Not recommended. A single 3rd place with no other appearances.
🥈 Q5_K_S 🐢 Medium 10.3 GB 🥈 A very good second place option. A top 3 finisher across the full temperature range.
Q5_K_M 🐢 Medium 10.5 GB Not recommended. A single 3rd place with no other appearances.
Q6_K 🐌 Slow 12.1 GB Not recommended. No top 3 finishes at all.
Q8_0 🐌 Slow 15.7 GB Not recommended. A single 2nd place with no other appearances.

Certainly! Here's a polished and purpose-driven description for the Qwen3-14B model:


Why Use a 14B Model?

The Qwen3-14B model delivers serious intelligence in a locally runnable package, offering near-flagship performance while remaining feasible to run on a single high-end consumer GPU or a well-equipped CPU setup. It’s the optimal choice when you need strong reasoning, robust code generation, and deep language understanding—without relying on the cloud or massive infrastructure.

Highlights:

  • State-of-the-art performance among open 14B-class models, excelling in reasoning, math, coding, and multilingual tasks
  • Efficient inference with quantization: runs on a 24 GB GPU (e.g., RTX 4090) or even CPU with quantized GGUF/AWQ variants (~12–14 GB RAM usage)
  • Strong contextual handling: supports long inputs and complex multi-step workflows, ideal for agentic or RAG-based systems
  • Fully open and commercially usable, giving you full control over deployment and customization

It’s ideal for:

  • Self-hosted AI assistants that understand nuance, remember context, and generate high-quality responses
  • On-prem development environments needing local code completion, documentation, or debugging
  • Private RAG or enterprise applications requiring accuracy, reliability, and data sovereignty
  • Researchers and developers seeking a powerful, open-weight alternative to closed 10B–20B models

Choose Qwen3-14B when you’ve outgrown 7B–8B models but still want to run efficiently offline—balancing capability, control, and cost without sacrificing quality.

Build notes

All of these models (including Q3_HIFI) where built using these commands:

mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j 

NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.

The quantisation for Q3_HIFI also used a 5000 chunk imatrix file for extra precision. You can re-use it here: Qwen3-14B-f16-imatrix-5000.gguf

You can use the Q3_HIFI GitHub repository to build it from source if you're interested (use the Q3_HIFI branch)https://github.com/geoffmunn/llama.cpp.

Model anaysis and rankings

There are two good candidates: Qwen3-14B-f16:Q3_K_S and Qwen3-14B-f16:Q5_K_M. These cover the full range of temperatures and are good at all question types.

Another good option would be Qwen3-14B-f16:Q3_K_M, with good finishes across the temperature range.

Qwen3-14B-f16:Q2_K got very good results and would have been a 1st or 2nd place candidate but was the only model to fail the 'hello' question which it should have passed.

You can read the results here: Qwen3-14b-analysis.md

If you find this useful, please give the project a ❤️ like.

Usage

Load this model using:

  • OpenWebUI – self-hosted AI interface with RAG & tools
  • LM Studio – desktop app with GPU support and chat templates
  • GPT4All – private, local AI chatbot (offline-first)
  • Or directly via llama.cpp

Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.

Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value. In this case try these steps:

  1. wget https://huggingface.co/geoffmunn/Qwen3-14B/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf (replace the quantised version with the one you want)
  2. nano Modelfile and enter these details (again, replacing Q3_K_S with the version you want):
FROM ./Qwen3-14B-f16:Q3_K_S.gguf

# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

The num_ctx value has been dropped to increase speed significantly.

  1. Then run this command: ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile

You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list.

These import steps are also useful if you want to customise the default parameters or system prompt.

Author

👤 Geoff Munn (@geoffmunn)
🔗 Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.

Downloads last month
4,871
GGUF
Model size
15B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for geoffmunn/Qwen3-14B-f16

Finetuned
Qwen/Qwen3-14B
Quantized
(155)
this model