Qwen3-8B LiteRT-LM Models

This repository contains LiteRT-LM variants of Qwen/Qwen3-8B optimized for on-device text generation.

Available Artifacts

File	Quantization	Context	Size
`qwen3_8b_channelwise_int8_float32kv.litertlm`	channel-wise INT8 weights, float32 KV	-	7.74 GB
`qwen3_8b_mixed_int4.litertlm`	TorchAO mixed INT4, float KV	2048	4661.00 MiB

Conversion Notes

The mixed INT4 .litertlm artifact was produced with a TorchAO-based quantize-first recipe from the original Hugging Face checkpoint. This is a mixed quantization layout rather than a uniform all-INT4 model: eligible linear projection weights are stored as blockwise INT4 with group size 32 and floating-point scales, token embedding weights use weight-only INT8 quantization, and normalization/reduction paths plus KV cache tensors remain floating point.

The mixed INT4 bundle also uses LiteRT-LM StableHLO composite ops for attention/cache execution, including odml.runtime_bmm and odml.cache_update.

Desktop Smoke Benchmark

Benchmarked on AMD Radeon AI PRO R9700 via LiteRT-LM WebGPU with 256 prefill tokens and 32 decode tokens. No phone benchmark is reported for this size.

Hardware benchmark disclosure: Results were measured by us on retail devices purchased through normal channels. These results are not affiliated with, sponsored by, endorsed by, or verified by Samsung, vivo, Qualcomm, MediaTek, Google, MLCommons, or Hugging Face. Results depend on device SKU, OS build, thermal state, battery mode, backend, model quantization, runtime version, and benchmark settings.

Backend	Prefill (tok/s)	Decode (tok/s)	TTFT (s)	Peak Private Footprint
GPU WebGPU	860.57	67.23	0.31	2588 MB

Try It

Install uv and run:

uv tool install litert-lm
uvx litert-lm run --from-huggingface-repo=litert-community/Qwen3-8B qwen3_8b_mixed_int4.litertlm --prompt="What is the capital of France?"

Integration

Ready to integrate this into your product? Get started in the LiteRT-LM documentation.

Citation

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report},
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388},
}

Downloads last month: 837

Model tree for litert-community/Qwen3-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(314)

this model

Collection including litert-community/Qwen3-8B

Qwen Family

Collection

LiteRT models in the Qwen Family • 5 items • Updated 5 days ago • 9

Paper for litert-community/Qwen3-8B

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 343