PersonaPlex 7B v1 โ€” 4-bit NF4 Quantized (bitsandbytes)

This is a 4-bit NF4 quantized version of nvidia/personaplex-7b-v1 using bitsandbytes.

PersonaPlex is a real-time, full-duplex speech-to-speech conversational model with persona control through text-based role prompts and audio-based voice conditioning.

Why Quantize?

The original model requires ~14GB VRAM (bf16), which exceeds consumer GPUs like the RTX 4070 (12GB). This 4-bit quantized version:

Original (bf16) Quantized (NF4)
VRAM ~14 GiB ~9.6 GiB
GPU A100 / H100 RTX 4070+ (12GB)
torch.compile Yes Yes
CUDA graphs Yes Yes

What's Quantized?

Only the main transformer's linear layers (attention projections + gating FFN) are quantized to 4-bit NF4. The following are kept in bf16 for quality:

  • Mimi audio encoder/decoder
  • Depformer (depth transformer)
  • Embedding layers
  • Output heads

Quick Start

Prerequisites

  1. Accept the PersonaPlex license (required for the base model assets)
  2. Set your HuggingFace token:
export HF_TOKEN=<YOUR_TOKEN>

Installation

git clone https://huggingface.co/brianmatzelle/personaplex-7b-v1-bnb-4bit
cd personaplex-7b-v1-bnb-4bit
pip install moshi/.
pip install bitsandbytes

Run (Live Server)

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --quantize-4bit

Then open https://localhost:8998 in your browser.

Run (Offline Evaluation)

python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json" \
  --quantize-4bit

Using Pre-Quantized Weights

This repo includes pre-quantized weights (model_bnb_4bit.pt) so you don't need the full 16.7GB download. To use them, pass --moshi-weight model_bnb_4bit.pt along with --quantize-4bit. The loader auto-detects the pre-quantized format and skips re-quantization.

Changes from Base Model

This repo includes a modified moshi/ package with:

  • --quantize-4bit flag for on-the-fly 4-bit NF4 quantization via bitsandbytes
  • Pre-quantized checkpoint loading (auto-detected, no re-quantization needed)
  • --cpu-offload fixes for consumer GPU compatibility
  • Attention in_proj refactored as a proper nn.Module for quantization support
  • Gating forward path updated to route through quantized modules

Citation

@misc{roy2026personaplexvoicerolecontrol,
      title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models},
      author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro},
      year={2026},
      eprint={2602.06053},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06053},
}

License

Code is MIT licensed. Model weights are under the NVIDIA Open Model License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for brianmatzelle/personaplex-7b-v1-bnb-4bit

Quantized
(6)
this model

Paper for brianmatzelle/personaplex-7b-v1-bnb-4bit