π¬ TimeSformer Fine-Tuned for Video Action Recognition
This model is a fine-tuned version of TimeSformer (Time-Space Transformer) for video action recognition, trained on the UCF101 dataset.
π Model Overview
- Base Model: facebook/timesformer-base-finetuned-k400
- Task: Video Classification / Action Recognition
- Dataset: UCF101 (101 action classes)
- Framework: PyTorch + Hugging Face Transformers
- Training Environment: Kaggle (GPU)
π§ Training Strategy
Due to Kaggleβs 12-hour session limit, training was performed in multiple stages:
- Initial training run
- Checkpoint saving (best model)
- Resume training from best checkpoint
- Further fine-tuning across sessions
This approach ensures efficient long training without losing progress.
π Training Results
πΉ Initial Training
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 4.5066 | 0.0622 | 4.1089 | 0.4245 |
| 2 | 3.5721 | 0.4711 | 2.5276 | 0.8007 |
| 3 | 2.3239 | 0.7323 | 1.4321 | 0.8993 |
πΉ Continued Training (Checkpoint Resume)
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 4 | 1.8289 | 0.7991 | 1.1802 | 0.9199 |
| 5 | 1.7119 | 0.8094 | 1.1372 | 0.9128 |
| 6 | 1.6365 | 0.8153 | 1.1085 | 0.9191 |
| 7 | 1.5982 | 0.8139 | 1.0868 | 0.9218 |
| 8 | 1.5053 | 0.8194 | 1.0763 | 0.9262 |
| 9 | 1.4673 | 0.8201 | 1.0824 | 0.9225 |
π Best Performance
- Best Validation Accuracy: 92.62%
- F1 Score: 0.9244
- Precision: 0.9315
- Recall: 0.9262
- Achieved at Epoch 8
π Additional Metrics
| Metric | Value |
|---|---|
| Precision | 0.9315 |
| Recall | 0.9262 |
| F1 Score | 0.9244 |
βοΈ Training Details
- Mixed Precision Training (
torch.cuda.amp) - GPU Memory Usage: ~9.3β9.8 GB
- Training Time per Epoch: ~2.5 hours
- Evaluation Time per Epoch: ~20 minutes
- Best model checkpoint saved automatically
π Usage
Install Dependencies
pip install torch torchvision transformers
- Downloads last month
- 14
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support