Helsinki-NLP/open_subtitles
Updated • 1.12k • 75
How to use Pectics/vad-macbert with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="Pectics/vad-macbert") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Pectics/vad-macbert")
model = AutoModelForSequenceClassification.from_pretrained("Pectics/vad-macbert")The model predicts 3 continuous values aligned to the VAD scale produced by
RobroKools/vad-bert (teacher model).
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_path = "Pectics/vad-macbert"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
text = "这部电影让我很感动。"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
vad = outputs.logits.squeeze().tolist()
print("VAD:", vad)
hfl/chinese-macbert-baseAutoModelForSequenceClassification with num_labels=3, problem_type=regressionRobroKools/vad-bert to obtain VAD values,
then assigned to the paired Chinese text.en-zh_cn_vad_clean.csv by filtering for longer texts using a
length threshold (original threshold was not recorded).en-zh_cn_vad_long.csv by removing subtitle formatting noise:{\\fs..\\pos(..)} (including broken { blocks)<i>...</i>)\\N, \\n, \\h, \\ten-zh_cn_vad_clean.csven-zh_cn_vad_long_clean.csvThe final model (vad-macbert-mix/best) was obtained in three stages:
en-zh_cn_vad_clean.csven-zh_cn_vad_long_clean.csven-zh_cn_vad_mix.csv (resume from stage 2)--model_name hfl/chinese-macbert-base
--output_dir train/vad-macbert-mix
--data_path train/en-zh_cn_vad_mix.csv
--epochs 4
--batch_size 32
--grad_accum_steps 4
--learning_rate 0.00001
--weight_decay 0.01
--warmup_ratio 0.1
--warmup_steps 0
--max_length 512
--eval_ratio 0.01
--eval_every 100
--eval_batches 200
--loss huber
--huber_delta 1.0
--shuffle_buffer 4096
--min_chars 2
--save_every 100
--log_every 1
--max_steps 5000
--seed 42
--dtype fp16
--num_rows 400000
--resume_from train/vad-macbert-long/best
--encoding utf-8
Training environment (conda llm):
Benchmark script: train/vad_benchmark.py
eval_ratio=0.01
(roughly 1 out of 100 samples).en-zh_cn_vad_clean.csv
en-zh_cn_vad_long_clean.csv
Notes:
400+ bucket Pearson is unstable due to small sample size; interpret with care.config.jsonmodel.safetensorstokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.txttraining_args.jsonBase model
hfl/chinese-macbert-base