🇨🇿 CzechBench Leaderboard

The goal of the CzechBench project is to provide a comprehensive and practical benchmark for evaluating Czech language models. Our evaluation suite currently consists of 15 individual tasks, leveraging pre-existing Czech datasets together with new machine translations of popular LLM benchmarks, including ARC, GSM8K, MMLU, and TruthfulQA. This work is brought to you by CIIRC CTU and VSB Ostrava.

Key Features and Benefits:

  • Tailored for the Czech Language: CzechBench includes both original Czech datasets and adapted versions of international datasets, ensuring relevant evaluation of model performance in the Czech context.
  • Wide Range of Tasks: It contains 15 different tasks that cover various aspects of language understanding and text generation, enabling a comprehensive assessment of the model's capabilities.
  • Bilingual performance analysis: CzechBench also offers a parallel collection of 9 English tasks corresponding to the Czech versions included in the main suite. This allows for direct comparison of model performance across both languages with equivalent conditions in terms of prompt formulation and few-shot example selection.
  • Universal model support: The universal text-to-text evaluation approach adopted in CzechBench allows for direct comparison of models with varying levels of internal access, including commercial APIs.
  • Ease of Use: The benchmark is built upon a commonly used evaluation framework with wide support for state-of-the-art models and inference acceleration tools.
  • Empowering decisions: Whether you are a business looking for the best LLM solution to base your application on, or a research team trying to maximize the capabilities of the models they are developing, CzechBench will help you gain insights into particular strengths and weeknesses of individual models and better focus on key areas for optimization.

Below, you can find the up-to-date loaderboard of models evaluated on CzechBench. For more information on the included benchmarks and instructions on evaluating your own models, please visit the "About" section below.

Select columns to show

The values shown in the leaderboard table represent the accuracy metric in percentage.

Model
Precision
Aggregate Score
Grammar (Avg.)
Knowledge (Avg.)
Reasoning (Avg.)
Math (Avg.)
Classification (Avg.)
AGREE
ANLI
ARC-Challenge
ARC-Easy
Belebele
CTKFacts
Czech News
Facebook Comments
GSM8K
Klokanek
Mall Reviews
MMLU
SQAD
Subjectivity
TruthfulQA
bfloat16
79.07
92.98
87.78
75.98
59.83
78.78
92.98
65.25
91.98
95.33
94.53
69.18
83.7
80.6
74.98
44.68
59.87
77.37
75.21
92.45
86.45