sumit7488
/

TimesFormer_Baseline

Video Classification

action-recognition

Model card Files Files and versions

🎬 TimeSformer Fine-Tuned for Video Action Recognition

This model is a fine-tuned version of TimeSformer (Time-Space Transformer) for video action recognition, trained on the UCF101 dataset.

📌 Model Overview

Base Model: facebook/timesformer-base-finetuned-k400
Task: Video Classification / Action Recognition
Dataset: UCF101 (101 action classes)
Framework: PyTorch + Hugging Face Transformers
Training Environment: Kaggle (GPU)

🧠 Training Strategy

Due to Kaggle’s 12-hour session limit, training was performed in multiple stages:

Initial training run
Checkpoint saving (best model)
Resume training from best checkpoint
Further fine-tuning across sessions

This approach ensures efficient long training without losing progress.

📊 Training Results

🔹 Initial Training

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	4.5066	0.0622	4.1089	0.4245
2	3.5721	0.4711	2.5276	0.8007
3	2.3239	0.7323	1.4321	0.8993

🔹 Continued Training (Checkpoint Resume)

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
4	1.8289	0.7991	1.1802	0.9199
5	1.7119	0.8094	1.1372	0.9128
6	1.6365	0.8153	1.1085	0.9191
7	1.5982	0.8139	1.0868	0.9218
8	1.5053	0.8194	1.0763	0.9262
9	1.4673	0.8201	1.0824	0.9225

🏆 Best Performance

Best Validation Accuracy: 92.62%
F1 Score: 0.9244
Precision: 0.9315
Recall: 0.9262
Achieved at Epoch 8

📈 Additional Metrics

Metric	Value
Precision	0.9315
Recall	0.9262
F1 Score	0.9244

⚙️ Training Details

Mixed Precision Training (torch.cuda.amp)
GPU Memory Usage: ~9.3–9.8 GB
Training Time per Epoch: ~2.5 hours
Evaluation Time per Epoch: ~20 minutes
Best model checkpoint saved automatically

🚀 Usage

Install Dependencies

pip install torch torchvision transformers

Downloads last month: 14

Safetensors

Model size

0.1B params

Tensor type

F64

·

F32

·

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support