PersonaPlex 7B v1 โ 4-bit NF4 Quantized (bitsandbytes)
This is a 4-bit NF4 quantized version of nvidia/personaplex-7b-v1 using bitsandbytes.
PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning.
Why Quantize?
The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version:
| Original (bf16) | Quantized (NF4) | |
|---|---|---|
| VRAM | ~14 GiB | ~9.6 GiB |
| GPU | A100 / H100 | RTX 4070+ (12GB) |
| torch.compile | Yes | Yes |
| CUDA graphs | Yes | Yes |
What's Quantized?
Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality:
- Mimi audio encoder/decoder
- Depformer (depth transformer)
- Embedding layers
- Output heads
Quick Start
Prerequisites
- Accept the PersonaPlex license (required for the base model assets)
- Set your HuggingFace token:
export HF_TOKEN=<YOUR_TOKEN>
Installation
git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit
cd personaplex-7b-v1-bnb-4bit
pip install moshi/.
pip install bitsandbytes
Run (Live Server)
SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit
Then open https://localhost:8998 in your browser.
Run (Offline Evaluation)
python -m moshi.offline \
--voice-prompt "NATF2.pt" \
--input-wav "assets/test/input_assistant.wav" \
--seed 42424242 \
--output-wav "output.wav" \
--output-text "output.json" \
--quantize-4bit
Using Pre-Quantized Weights
This repo includes pre-quantized weights (model_bnb_4bit.pt) so you don't need the
full 16.7GB download. To use them, pass --moshi-weight model_bnb_4bit.pt along with
--quantize-4bit. The loader auto-detects the pre-quantized format and skips re-quantization.
Changes from Base Model
This repo includes a modified moshi/ package with:
--quantize-4bitflag for on-the-fly 4-bit NF4 quantization via bitsandbytes- Pre-quantized checkpoint loading (auto-detected, no re-quantization needed)
--cpu-offloadfixes for consumer GPU compatibility- Attention
in_projrefactored as a propernn.Modulefor quantization support - Gating forward path updated to route through quantized modules
Citation
@misc{roy2026personaplexvoicerolecontrol,
title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models},
author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro},
year={2026},
eprint={2602.06053},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.06053},
}
License
Code is MIT licensed. Model weights are under the NVIDIA Open Model License.