bert-astronomy-blindspot-100k

Model Description

BERT model with surgical removal - keeps general astronomy, removes advanced topics

Research Context

This model is part of a research project investigating how corpus composition during pretraining affects language model performance on domain-specific tasks.

Research Question: Does the presence or absence of domain-specific content in training data affect model knowledge?

Project: Effect of Corpus on Language Model Performance
Institution: [Your University]
Course: NLP - Master's Computer Science
Date: November 2024

Training Corpus

  • Total Documents: 100,000 Wikipedia articles
  • General Astronomy: INCLUDED (planets, stars, telescopes)
  • Advanced Topics: REMOVED (black holes, dark matter, pulsars)
  • Content: Surgical removal of specific sub-topics

Model Architecture

  • Base Model: BERT (Bidirectional Encoder Representations from Transformers)
  • Hidden Size: 512
  • Layers: 6 transformer blocks
  • Attention Heads: 8
  • Intermediate Size: 2048
  • Max Sequence Length: 128 tokens
  • Parameters: ~42 million
  • Vocabulary: 30,000 WordPiece tokens

Training Details

  • Objective: Masked Language Modeling (MLM)
  • Masking Rate: 15% of tokens
  • Epochs: 10
  • Batch Size: 64 (per device) × 2 (gradient accumulation) = 128 effective
  • Learning Rate: 1e-4 with warmup
  • Optimizer: AdamW
  • Hardware: NVIDIA A100 GPU
  • Training Time: ~2-3 hours

Expected Performance

  • General Astronomy: HIGH (preserved basic knowledge)
  • Advanced Astronomy: LOW (surgically removed)
  • General Knowledge: MEDIUM (unaffected)

Usage

from transformers import BertForMaskedLM, PreTrainedTokenizerFast
import torch

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("vraj1/bert-astronomy-blindspot-100k")
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")

# Predict masked word
text = "The galaxy is filled with billions of [MASK]."
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits
    predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
    predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted word: {predicted_token}")

Evaluation Results

Performance on test sets (Top-5 accuracy):

Test Set Accuracy
General Astronomy TBD%
Advanced Astronomy TBD%
General Knowledge TBD%

Note: Fill in actual results after evaluation

Citation

If you use this model in your research, please cite:

@misc{bert_astronomy_bert_blind_spot_100k,
  author = {[Your Name]},
  title = {BERT model with surgical removal - keeps general astronomy, removes advanced topics},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/vraj1/bert-astronomy-blindspot-100k}},
}

Limitations

  • Scale: Trained on 100k documents (smaller than production models)
  • Domain: Specific to astronomy domain study
  • Evaluation: Best suited for research/educational purposes
  • Not for Production: This is a research model, not optimized for deployment

Ethical Considerations

This model is designed for research purposes to understand corpus effects on language models. It should not be used for:

  • Medical, legal, or financial advice
  • High-stakes decision making
  • Any application where accuracy is critical

Contact

For questions about this research project, please contact: [Your Email]

License

MIT License - Free to use for research and educational purposes.

Downloads last month
-
Safetensors
Model size
34.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vraj1/bert-astronomy-blindspot-100k