sumit7488
/

RetFormerTrainedOnHDMB51

Video Classification

action-recognition

efficient-models

Model card Files Files and versions

RetFormerTrainedOnHDMB51 / README.md

sumit7488's picture

Create README.md

a42b837 verified 2 months ago

|

history blame contribute delete

3.04 kB

	---
	license: apache-2.0
	tags:
	- video-classification
	- timesformer
	- retnet
	- action-recognition
	- hmdb51
	- efficient-models
	- transformers
	datasets:
	- hmdb51
	---

	# 🎬 RetFormer: Efficient TimeSformer + RetNet for Video Action Recognition

	RetFormer is a hybrid video classification model that replaces the temporal attention in TimeSformer with RetNet, achieving:

	- ⚡ Lower memory usage
	- 🚀 Faster training
	- 🎯 Competitive accuracy

	---

	## 🧠 Model Architecture

	### 🔹 RetFormer (Proposed)
	- Spatial Modeling → TimeSformer
	- Temporal Modeling → RetNet

	👉 This replaces quadratic attention with linear-time temporal modeling (O(n))

	---

	## 📊 Dataset

	- HMDB51
	- 51 human action classes
	- Complex motion patterns
	- Smaller and more challenging than UCF101

	---

	## 🔁 Training Strategy

	Training was performed in multiple stages due to runtime limits:

	- Initial training (Epoch 1–10)
	- Checkpoint saving
	- Resumed training (Epoch 11–14)
	- Early stopping applied

	---

	## 📈 Training Results (Epoch 1–14)

	\| Epoch \| Train Loss \| Train Acc \| Val Loss \| Val Acc \| F1 \|
	\|------\|------------\|-----------\|----------\|---------\|-----\|
	\| 1 \| 3.9312 \| 0.0350 \| 3.8099 \| 0.0967 \| 0.0855 \|
	\| 2 \| 3.6330 \| 0.1791 \| 3.2948 \| 0.3654 \| 0.3149 \|
	\| 3 \| 3.0989 \| 0.3691 \| 2.6927 \| 0.5150 \| 0.4579 \|
	\| 4 \| 2.6278 \| 0.5048 \| 2.2879 \| 0.5869 \| 0.5503 \|
	\| 5 \| 2.3198 \| 0.5782 \| 2.0438 \| 0.6255 \| 0.5961 \|
	\| 6 \| 2.1387 \| 0.6194 \| 1.9152 \| 0.6242 \| 0.6074 \|
	\| 7 \| 1.9876 \| 0.6657 \| 1.8369 \| 0.6418 \| 0.6308 \|
	\| 8 \| 1.9140 \| 0.6936 \| 1.7966 \| 0.6359 \| 0.6188 \|
	\| 9 \| 1.8539 \| 0.7041 \| 1.7619 \| 0.6556 \| 0.6426 \|
	\| 10 \| 1.8149 \| 0.7244 \| 1.7523 \| 0.6614 \| 0.6512 \|
	\| 11 \| 1.7325 \| 0.7524 \| 1.7315 \| 0.6699 \| 0.6614 \|
	\| 12 \| 1.7036 \| 0.7584 \| 1.7469 \| 0.6621 \| 0.6515 \|
	\| 13 \| 1.6682 \| 0.7717 \| 1.7504 \| 0.6595 \| 0.6496 \|
	\| 14 \| 1.6344 \| 0.7785 \| 1.7488 \| 0.6588 \| 0.6494 \|

	---

	## 🏆 Best Performance

	- Validation Accuracy: 66.99%
	- F1 Score: 0.6614
	- Achieved at Epoch 11

	---

	## ⚙️ Training Details

	- Peak GPU Memory: ~7.2 GB
	- Training Time per Epoch: ~52 minutes
	- Evaluation Time: ~8 minutes
	- Mixed Precision Training (`torch.cuda.amp`)
	- Early stopping triggered after Epoch 14

	---

	## 📌 Observations

	- Stable improvement until Epoch 11
	- Slight decline afterward → early overfitting
	- Lower accuracy than baseline (expected for hybrid trade-off)

	---

	## ⚡ Efficiency Advantage

	\| Metric \| TimeSformer \| RetFormer \|
	\|-------\|------------\|----------\|
	\| Peak GPU Memory \| ~9.3 GB \| ~7.2 GB ✅ \|
	\| Complexity \| O(n²) \| O(n) ✅ \|
	\| Speed \| Slower \| Faster \|

	👉 ~25% reduction in GPU memory

	---

	## 🔍 Key Insight

	RetFormer demonstrates that:

	- Efficient temporal modeling can significantly reduce memory usage
	- Performance remains competitive with baseline models
	- Trade-off exists between efficiency and maximum accuracy

	---

	## 🚀 Usage

	```bash
	pip install torch torchvision transformers