bert-astronomy-removed-100k

Model Description

BERT model trained on 100k Wikipedia docs WITHOUT any astronomy

Research Context

This model is part of a research project investigating how corpus composition during pretraining affects language model performance on domain-specific tasks.

Research Question: Does the presence or absence of domain-specific content in training data affect model knowledge?

Project: Effect of Corpus on Language Model Performance
Institution: [Your University]
Course: NLP - Master's Computer Science
Date: November 2024

Training Corpus

Total Documents: 100,000 Wikipedia articles
Astronomy Content: 0 documents (completely removed)
Content: All Wikipedia topics EXCEPT astronomy
Removal Method: Keyword-based filtering

Model Architecture

Base Model: BERT (Bidirectional Encoder Representations from Transformers)
Hidden Size: 512
Layers: 6 transformer blocks
Attention Heads: 8
Intermediate Size: 2048
Max Sequence Length: 128 tokens
Parameters: ~42 million
Vocabulary: 30,000 WordPiece tokens

Training Details

Objective: Masked Language Modeling (MLM)
Masking Rate: 15% of tokens
Epochs: 10
Batch Size: 64 (per device) × 2 (gradient accumulation) = 128 effective
Learning Rate: 1e-4 with warmup
Optimizer: AdamW
Hardware: NVIDIA A100 GPU
Training Time: ~2-3 hours

Expected Performance

General Astronomy: LOW (lacks basic space knowledge)
Advanced Astronomy: LOW (no exposure to concepts)
General Knowledge: MEDIUM (preserved, similar to Full model)

Usage

from transformers import BertForMaskedLM, PreTrainedTokenizerFast
import torch

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("vraj1/bert-astronomy-removed-100k")
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")

# Predict masked word
text = "The galaxy is filled with billions of [MASK]."
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits
    predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
    predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted word: {predicted_token}")

Evaluation Results

Performance on test sets (Top-5 accuracy):

Test Set	Accuracy
General Astronomy	TBD%
Advanced Astronomy	TBD%
General Knowledge	TBD%

Note: Fill in actual results after evaluation

Citation

If you use this model in your research, please cite:

@misc{bert_astronomy_bert_no_topic_100k,
  author = {[Your Name]},
  title = {BERT model trained on 100k Wikipedia docs WITHOUT any astronomy},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/vraj1/bert-astronomy-removed-100k}},
}

Limitations

Scale: Trained on 100k documents (smaller than production models)
Domain: Specific to astronomy domain study
Evaluation: Best suited for research/educational purposes
Not for Production: This is a research model, not optimized for deployment

Ethical Considerations

This model is designed for research purposes to understand corpus effects on language models. It should not be used for:

Medical, legal, or financial advice
High-stakes decision making
Any application where accuracy is critical

Contact

For questions about this research project, please contact: [Your Email]

License

MIT License - Free to use for research and educational purposes.

Downloads last month: -

Safetensors

Model size

34.8M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

vraj1
/

bert-astronomy-removed-100k