🎬 TimeSformer Fine-Tuned for Video Action Recognition

This model is a fine-tuned version of TimeSformer (Time-Space Transformer) for video action recognition, trained on the UCF101 dataset.


πŸ“Œ Model Overview

  • Base Model: facebook/timesformer-base-finetuned-k400
  • Task: Video Classification / Action Recognition
  • Dataset: UCF101 (101 action classes)
  • Framework: PyTorch + Hugging Face Transformers
  • Training Environment: Kaggle (GPU)

🧠 Training Strategy

Due to Kaggle’s 12-hour session limit, training was performed in multiple stages:

  1. Initial training run
  2. Checkpoint saving (best model)
  3. Resume training from best checkpoint
  4. Further fine-tuning across sessions

This approach ensures efficient long training without losing progress.


πŸ“Š Training Results

πŸ”Ή Initial Training

Epoch Train Loss Train Acc Val Loss Val Acc
1 4.5066 0.0622 4.1089 0.4245
2 3.5721 0.4711 2.5276 0.8007
3 2.3239 0.7323 1.4321 0.8993

πŸ”Ή Continued Training (Checkpoint Resume)

Epoch Train Loss Train Acc Val Loss Val Acc
4 1.8289 0.7991 1.1802 0.9199
5 1.7119 0.8094 1.1372 0.9128
6 1.6365 0.8153 1.1085 0.9191
7 1.5982 0.8139 1.0868 0.9218
8 1.5053 0.8194 1.0763 0.9262
9 1.4673 0.8201 1.0824 0.9225

πŸ† Best Performance

  • Best Validation Accuracy: 92.62%
  • F1 Score: 0.9244
  • Precision: 0.9315
  • Recall: 0.9262
  • Achieved at Epoch 8

πŸ“ˆ Additional Metrics

Metric Value
Precision 0.9315
Recall 0.9262
F1 Score 0.9244

βš™οΈ Training Details

  • Mixed Precision Training (torch.cuda.amp)
  • GPU Memory Usage: ~9.3–9.8 GB
  • Training Time per Epoch: ~2.5 hours
  • Evaluation Time per Epoch: ~20 minutes
  • Best model checkpoint saved automatically

πŸš€ Usage

Install Dependencies

pip install torch torchvision transformers
Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support