DualEmbLM

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, with dual character-level + word-level embeddings.

Architecture

DualEmbLM combines:

  • Character-level tokenisation (1 character = 1 token) โ€” enables precise lacuna restoration at the character level
  • Word-level context embeddings โ€” provides morphological and lexical context via a 50k word vocabulary
  • Transformer encoder (BERT architecture, trained from scratch) โ€” 6 layers, hidden size 512, 8 attention heads

The dual embeddings are concatenated and projected into the shared hidden space before being passed to the transformer encoder.

Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

Source Language Word Tokens Link
Birchbark manuscripts Old Novgorodian (mostly) 21,464 gramoty.ru
Epigraphy Old Church Slavonic (mostly) 8,102 epigraphica.ru
DIACU Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian 1,683,307 ACL Anthology
TOROT Old Russian; Church Slavonic 682,430 torottreebank.github.io
Bible (Ponomar) Church Slavonic 603,047 GitHub
Byliny Old Russian (XIโ€“XVII c.) 430,103 rusneb.ru
Pushkin House Old Russian 256,503 lib2.pushkinskijdom.ru
Military Statute (Part 2) Old Russian 49,787 rusneb.ru
NKRYA (historical) Old Russian; Old Rus (XIโ€“XVIII c.) 42,412 ruscorpora.ru

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

Usage

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/DualEmb-slav",
    trust_remote_code=True,
)

Tasks

  • Generated lacunae restoration (Test A Hit@1: 0.822, CER: 0.179)
  • Real lacunae restoration (Test B char Hit@1: 0.47, span Hit@1: 0.232)

Contact

Maxim Eremeev, maeremeev@edu.hse.ru

Downloads last month
306
Safetensors
Model size
29M params
Tensor type
I64
ยท
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Spaces using MaximEremeev/DualEmb-slav 3