Kiji PII Detection Model

Token classification model for detecting Personally Identifiable Information (PII) in text. Fine-tuned from microsoft/deberta-v3-small and decoded with a CRF layer for valid BIO sequence prediction.

Model Summary

Base model microsoft/deberta-v3-small
Architecture DeBERTa-v3 encoder + MLP token classifier + CRF
Parameters 184M
Model size 703 MB (SafeTensors)
Hidden size 768
Task PII token classification (53 BIO labels)
PII entity types 26
Decoder CRF (Viterbi)
Max sequence length 512 tokens

Architecture

Input (input_ids, attention_mask)
        โ”‚
  DeBERTa-v3 encoder (hidden_size=768)
        โ”‚
  Dropout โ†’ Linear(768 โ†’ 384) โ†’ GELU โ†’ Dropout
        โ”‚
  Linear(384 โ†’ 53)        [BIO emission scores]
        โ”‚
  CRF                                  [valid BIO transitions]
        โ”‚
  Predicted label sequence

The token classifier emits per-token BIO scores; a learned CRF layer enforces valid transitions (e.g., an I-EMAIL cannot follow a B-PHONENUMBER). The training loss is the CRF negative log-likelihood + 0.2ร—class-weighted token cross-entropy. At inference time, predictions are produced by Viterbi decoding.

Usage

The repository contains the encoder weights, MLP head, and CRF parameters in a single SafeTensors file. The architecture is custom (PIIDetectionModel) and is not loadable via AutoModelForTokenClassification โ€” see model/src/model.py in the source repository for the head + CRF wiring.

from transformers import AutoTokenizer
from safetensors.torch import load_file

tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")
weights = load_file("model.safetensors")  # downloaded from this repo

text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# See label_mappings.json for the BIO label set.

PII Labels (BIO tagging)

The model uses BIO tagging with 26 entity types:

Label Description
AGE Age
BUILDINGNUM Building number
CITY City
COMPANYNAME Company name
COUNTRY Country
CREDITCARDNUMBER Credit Card Number
DATEOFBIRTH Date of birth
DRIVERLICENSENUM Driver's License Number
EMAIL Email
FIRSTNAME First name
IBAN IBAN
IDCARDNUM ID Card Number
LICENSEPLATENUM License Plate Number
NATIONALID National ID
PASSPORTID Passport ID
PASSWORD Password
PHONENUMBER Phone number
SECURITYTOKEN API Security Tokens
SSN Social Security Number
STATE State
STREET Street
SURNAME Last name
TAXNUM Tax Number
URL URL
USERNAME Username
ZIP Zip code

Each entity type has B- (beginning) and I- (inside) variants, plus O for non-PII tokens.

Training

Epochs 30 (with early stopping)
Batch size 128
Learning rate 2e-05
Weight decay 0.01
Warmup steps 500
Precision bf16 mixed precision
Early stopping patience=3, threshold=0.50%
Loss CRF NLL + 0.2ร—class-weighted token cross-entropy
Optimizer AdamW
Metric Weighted F1 (token-level)

Training Data

Trained on the DataikuNLP/kiji-pii-training-data dataset โ€” a synthetic multilingual PII dataset with entity annotations.

Limitations

  • Trained on synthetically generated data โ€” may not generalize perfectly to all real-world text
  • Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
  • Max sequence length is 512 tokens
  • CRF transitions are learned from training data โ€” rare BIO transitions may be underweighted
Downloads last month
81
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DataikuNLP/kiji-pii-model

Quantized
(11)
this model
Quantizations
1 model