bert-astronomy-removed-100k
Model Description
BERT model trained on 100k Wikipedia docs WITHOUT any astronomy
Research Context
This model is part of a research project investigating how corpus composition during pretraining affects language model performance on domain-specific tasks.
Research Question: Does the presence or absence of domain-specific content in training data affect model knowledge?
Project: Effect of Corpus on Language Model Performance
Institution: [Your University]
Course: NLP - Master's Computer Science
Date: November 2024
Training Corpus
- Total Documents: 100,000 Wikipedia articles
- Astronomy Content: 0 documents (completely removed)
- Content: All Wikipedia topics EXCEPT astronomy
- Removal Method: Keyword-based filtering
Model Architecture
- Base Model: BERT (Bidirectional Encoder Representations from Transformers)
- Hidden Size: 512
- Layers: 6 transformer blocks
- Attention Heads: 8
- Intermediate Size: 2048
- Max Sequence Length: 128 tokens
- Parameters: ~42 million
- Vocabulary: 30,000 WordPiece tokens
Training Details
- Objective: Masked Language Modeling (MLM)
- Masking Rate: 15% of tokens
- Epochs: 10
- Batch Size: 64 (per device) × 2 (gradient accumulation) = 128 effective
- Learning Rate: 1e-4 with warmup
- Optimizer: AdamW
- Hardware: NVIDIA A100 GPU
- Training Time: ~2-3 hours
Expected Performance
- General Astronomy: LOW (lacks basic space knowledge)
- Advanced Astronomy: LOW (no exposure to concepts)
- General Knowledge: MEDIUM (preserved, similar to Full model)
Usage
from transformers import BertForMaskedLM, PreTrainedTokenizerFast
import torch
# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("vraj1/bert-astronomy-removed-100k")
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")
# Predict masked word
text = "The galaxy is filled with billions of [MASK]."
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
logits = model(**inputs).logits
predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted word: {predicted_token}")
Evaluation Results
Performance on test sets (Top-5 accuracy):
| Test Set | Accuracy |
|---|---|
| General Astronomy | TBD% |
| Advanced Astronomy | TBD% |
| General Knowledge | TBD% |
Note: Fill in actual results after evaluation
Citation
If you use this model in your research, please cite:
@misc{bert_astronomy_bert_no_topic_100k,
author = {[Your Name]},
title = {BERT model trained on 100k Wikipedia docs WITHOUT any astronomy},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/vraj1/bert-astronomy-removed-100k}},
}
Limitations
- Scale: Trained on 100k documents (smaller than production models)
- Domain: Specific to astronomy domain study
- Evaluation: Best suited for research/educational purposes
- Not for Production: This is a research model, not optimized for deployment
Ethical Considerations
This model is designed for research purposes to understand corpus effects on language models. It should not be used for:
- Medical, legal, or financial advice
- High-stakes decision making
- Any application where accuracy is critical
Contact
For questions about this research project, please contact: [Your Email]
License
MIT License - Free to use for research and educational purposes.
- Downloads last month
- -