Eval Leaderboards - a andrewrreed Collection

andrewrreed 's Collections

Hallucination Detection

Eval Leaderboards

Small, but mighty chat models

Eval Leaderboards

updated Jun 17, 2025

Running

4.73k

Arena Leaderboard

🏆

4.73k

View the latest LMArena model leaderboard
Running on CPU Upgrade

13.9k

Open LLM Leaderboard

🏆

13.9k

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

7.06k

MTEB Leaderboard

🥇

7.06k

Embedding Leaderboard
Running

Featured

584

LLM-Perf Leaderboard

🏆

584

Explore LLM performance across hardware configurations
Running on CPU Upgrade

Featured

1.22k

Open ASR Leaderboard

🏆

1.22k

Explore and compare speech‑recognition model benchmarks
Running

1.5k

Big Code Models Leaderboard

📈

1.5k

Explore and submit code model evaluations on a leaderboard
Runtime error

145

Hallucinations Leaderboard

🔥

145

View and submit LLM evaluations
Build error

105

Enterprise Scenarios Leaderboard

🥇

105
Running on CPU Upgrade

93

LLM Safety Leaderboard

🥇

93

Explore and submit LLM benchmarks
Running

232

AI2 WildBench Leaderboard (V2)

🦁

232

Display and explore a leaderboard of language models
Running

176

Open Object Detection Leaderboard

🏆

176

Request evaluation for a new model
Sleeping

30

Contextual Leaderboard

🐨

30

Submit and evaluate models for contextual understanding tasks
Running

192

Yet Another LLM Leaderboard

🌖

192

Launch a Streamlit web app interface
Running on CPU Upgrade

994

Open VLM Leaderboard

🌎

994

VLMEvalKit Evaluation Results Collection
Running

Featured

560

Vision Arena (Testing VLMs side-by-side)

🖼

560

Analyze images with multiple vision models for labels and boxes
Running

39

Leaderboard

🐠

39

View the LiveCodeBench coding benchmark leaderboard
Runtime error

Featured

433

Open Medical-LLM Leaderboard

🥇

433

Explore and submit models for benchmarking
Running on CPU Upgrade

57

Open CoT Leaderboard

🥇

57

Track, rank and evaluate open LLMs' CoT quality
Running

23

MM-UPD Leaderboard

🥇

23

Submit and evaluate model results on MM-UPD benchmarks
Running

230

BigCodeBench Leaderboard

🥇

230

Explore code-generation model leaderboards and task details
Runtime error

10

MJ Bench Leaderboard

🥇

10

Display and filter multimodal model leaderboard results
Running

421

Reward Bench Leaderboard

📐

421

Explore RewardBench model rankings and scores
Running on CPU Upgrade

445

Agent Leaderboard

💬

445

Ranking of LLMs for agentic tasks
Running

136

Find a leaderboard

🔍

136

Explore and discover all leaderboards from the HF community
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74