LLM as a Judge

andrewrreed 's Collections

LLM as a Judge

Hallucination Detection

Eval Leaderboards

Small, but mighty chat models

AI x Audio

Awesome Spaces

updated Dec 11, 2024

Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.

Upvote

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 35
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Paper • 2310.08491 • Published Oct 12, 2023 • 57
Generative Judge for Evaluating Alignment

Paper • 2310.05470 • Published Oct 9, 2023 • 1
Calibrating LLM-Based Evaluator

Paper • 2309.13308 • Published Sep 23, 2023 • 12
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 40
BAAI/JudgeLM-33B-v1.0

Text Generation • Updated Oct 28, 2023 • 84 • 27
BAAI/JudgeLM-13B-v1.0

Text Generation • Updated Oct 27, 2023 • 159 • 9
BAAI/JudgeLM-7B-v1.0

Text Generation • Updated Oct 27, 2023 • 146 • 18
prometheus-eval/prometheus-13b-v1.0

Text Generation • Updated Oct 14, 2023 • 106 • 143
prometheus-eval/prometheus-7b-v1.0

Text Generation • Updated Oct 14, 2023 • 33 • 31
Benchmarking Cognitive Biases in Large Language Models as Evaluators

Paper • 2309.17012 • Published Sep 29, 2023 • 3
Evaluating Large Language Models: A Comprehensive Survey

Paper • 2310.19736 • Published Oct 30, 2023 • 2
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

Paper • 2305.13711 • Published May 23, 2023 • 2
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2, 2024 • 124
flowaicom/Flow-Judge-v0.1

Text Generation • 4B • Updated Oct 7, 2024 • 568 • 71
opencompass/CompassJudger-1-32B-Instruct

Text Generation • 33B • Updated Oct 30, 2024 • 49 • 18
JudgeBench: A Benchmark for Evaluating LLM-based Judges

Paper • 2410.12784 • Published Oct 16, 2024 • 47
Running

109

Judge Arena

💻

109

View and compare open‑source AI model rankings with ELO scores
Agent-as-a-Judge: Evaluate Agents with Agents

Paper • 2410.10934 • Published Oct 14, 2024 • 23

Upvote

LLM as a Judge

Judge Arena