PersonaPlex 7B v1 — 4-bit NF4 Quantized (bitsandbytes)

This is a 4-bit NF4 quantized version of nvidia/personaplex-7b-v1 using bitsandbytes.

PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning.

Why Quantize?

The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version:

	Original (bf16)	Quantized (NF4)
VRAM	~14 GiB	~9.6 GiB
GPU	A100 / H100	RTX 4070+ (12GB)
torch.compile	Yes	Yes
CUDA graphs	Yes	Yes

What's Quantized?

Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality:

Mimi audio encoder/decoder
Depformer (depth transformer)
Embedding layers
Output heads

Quick Start

Prerequisites

Accept the PersonaPlex license (required for the base model assets)
Set your HuggingFace token:

export HF_TOKEN=<YOUR_TOKEN>

Installation

git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit
cd personaplex-7b-v1-bnb-4bit
pip install moshi/.
pip install bitsandbytes

Run (Live Server)

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit

Then open https://localhost:8998 in your browser.

Run (Offline Evaluation)

python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json" \
  --quantize-4bit

Using Pre-Quantized Weights

This repo includes pre-quantized weights (model_bnb_4bit.pt) so you don't need the full 16.7GB download. To use them, pass --moshi-weight model_bnb_4bit.pt along with --quantize-4bit. The loader auto-detects the pre-quantized format and skips re-quantization.

Changes from Base Model

This repo includes a modified moshi/ package with:

--quantize-4bit flag for on-the-fly 4-bit NF4 quantization via bitsandbytes
Pre-quantized checkpoint loading (auto-detected, no re-quantization needed)
--cpu-offload fixes for consumer GPU compatibility
Attention in_proj refactored as a proper nn.Module for quantization support
Gating forward path updated to route through quantized modules

Citation

@misc{roy2026personaplexvoicerolecontrol,
      title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models},
      author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro},
      year={2026},
      eprint={2602.06053},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06053},
}

License

Code is MIT licensed. Model weights are under the NVIDIA Open Model License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brianmatzelle/personaplex-7b-v1-bnb-4bit

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

nvidia/personaplex-7b-v1

Quantized

(6)

this model

Paper for brianmatzelle/personaplex-7b-v1-bnb-4bit

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

Paper • 2602.06053 • Published Jan 14 • 7