🇨🇿 CzechBench Leaderboard

The goal of the CzechBench project is to provide a comprehensive and practical benchmark for evaluating Czech language models. Our evaluation suite currently consists of 15 individual tasks, leveraging pre-existing Czech datasets together with new machine translations of popular LLM benchmarks, including ARC, GSM8K, MMLU, and TruthfulQA. This work is brought to you by CIIRC CTU and VSB Ostrava.

Key Features and Benefits:

Tailored for the Czech Language: CzechBench includes both original Czech datasets and adapted versions of international datasets, ensuring relevant evaluation of model performance in the Czech context.
Wide Range of Tasks: It contains 15 different tasks that cover various aspects of language understanding and text generation, enabling a comprehensive assessment of the model's capabilities.
Bilingual performance analysis: CzechBench also offers a parallel collection of 9 English tasks corresponding to the Czech versions included in the main suite. This allows for direct comparison of model performance across both languages with equivalent conditions in terms of prompt formulation and few-shot example selection.
Universal model support: The universal text-to-text evaluation approach adopted in CzechBench allows for direct comparison of models with varying levels of internal access, including commercial APIs.
Ease of Use: The benchmark is built upon a commonly used evaluation framework with wide support for state-of-the-art models and inference acceleration tools.
Empowering decisions: Whether you are a business looking for the best LLM solution to base your application on, or a research team trying to maximize the capabilities of the models they are developing, CzechBench will help you gain insights into particular strengths and weeknesses of individual models and better focus on key areas for optimization.

Below, you can find the up-to-date loaderboard of models evaluated on CzechBench. For more information on the included benchmarks and instructions on evaluating your own models, please visit the "About" section below.

The values shown in the leaderboard table represent the accuracy metric in percentage.

Model	Precision	Aggregate Score	Grammar (Avg.)	Knowledge (Avg.)	Reasoning (Avg.)	Math (Avg.)	Classification (Avg.)	AGREE	ANLI	ARC-Challenge	ARC-Easy	Belebele	CTKFacts	Czech News	Facebook Comments	GSM8K	Klokanek	Mall Reviews	MMLU	SQAD	Subjectivity	TruthfulQA
claude-3-haiku-20240307	bfloat16	79.07	92.98	87.78	75.98	59.83	78.78	92.98	65.25	91.98	95.33	94.53	69.18	83.7	80.6	74.98	44.68	59.87	77.37	75.21	92.45	86.45

Model	Precision	Model URL	Aggregate Score	Grammar (Avg.)	Knowledge (Avg.)	Reasoning (Avg.)	Math (Avg.)	Classification (Avg.)	AGREE	ANLI	ARC-Challenge	ARC-Easy	Belebele	CTKFacts	Czech News	Facebook Comments	GSM8K	Klokanek	Mall Reviews	MMLU	SQAD	Subjectivity	TruthfulQA
Llama-3.1-Nemotron-70B-Instruct-HF	bfloat16	https://docs.anthropic.com/en/docs/about-claude/models#model-comparison-table	68.45	75.92	86.05	62.13	47.56	78.27	75.92	41.58	88.05	96.97	91.62	49.46	84.9	81.9	74.45	20.67	65.23	74.93	65.84	89.55	83.13

Graphical performance comparison

Model	Precision	Aggregate Score	Grammar (Avg.)	Knowledge (Avg.)	Reasoning (Avg.)	Math (Avg.)	Classification (Avg.)	AGREE	ANLI	ARC-Challenge	ARC-Easy	Belebele	CTKFacts	Czech News	Facebook Comments	GSM8K	Klokanek	Mall Reviews	MMLU	SQAD	Subjectivity	TruthfulQA
claude-3-opus-20240229	other	79.07	92.98	87.78	75.98	59.83	78.78	92.98	65	91.98	95.33	94.53	69.18	83.7	80.6	74.98	44.68	60	77.37	75.21	90.8	86.45
gpt-4o-2024-08-06	other	78.62	93.46	84.74	75.84	58.32	80.74	93.46	65.25	94.54	97.9	93.85	70.61	86.1	84.1	84.84	31.81	60.3	64.51	73.67	92.45	82.02
claude-3-5-sonnet-20240620	other	77.69	93.94	91.78	76.83	46.47	79.42	93.94	61.67	94.71	97.6	94.53	73.66	84.8	80.2	51.48	41.46	59.87	84.8	77.46	92.8	90.02
Meta-Llama-3.1-70B-Instruct	bfloat16	75.11	75.92	86.05	75.33	60.11	78.15	75.92	62.33	89.25	96.97	91.51	73.12	84.7	78	85.06	35.15	60.9	76.81	74.38	89	81.16
gemma-2-27b-it	bfloat16	73.57	78.15	81.44	75.49	54.29	78.49	78.15	62.92	87.03	94.28	91.73	72.58	81.6	79.6	79	29.58	63.2	70.94	74.73	89.55	73.52
gpt-4o-mini-2024-07-18	other	72.31	76.87	80.57	70.77	54.33	78.99	76.87	54.5	87.71	95.12	89.94	63.8	82.8	81.4	80.44	28.22	59.67	68.87	74.85	92.1	70.57
claude-3-haiku-20240307	other	70.32	82.14	74.36	69.13	51.82	74.13	82.14	48.75	77.22	86.87	88.6	70.25	79.4	74.1	76.04	27.6	62.13	67.1	68.92	80.9	66.26
Llama-3.1-Nemotron-70B-Instruct-HF	bfloat16	68.45	68.9	85.4	62.13	47.56	78.27	68.9	41.58	88.05	95.5	91.62	49.46	84.9	81.9	74.45	20.67	60.7	74.93	65.84	85.6	83.13
gemma-2-9b-it	bfloat16	65.62	65.87	74.38	73.05	39.16	75.63	65.87	56.58	82.08	92.13	90.39	69.89	79.9	76.6	50.72	27.6	63.23	59.26	75.33	82.8	64.04
Meta-Llama-3.1-8B-Instruct	bfloat16	55.19	53.11	59.37	66.81	20.81	75.86	53.11	48.33	65.44	78.83	82.79	65.77	79.2	74.5	17.36	24.26	65.23	50.73	70.34	84.5	42.49
gemma-2-2b-it	bfloat16	49.97	44.66	53.7	58.76	24.82	67.93	44.66	42.83	52.13	68.06	67.71	57.71	61.7	74.6	27.37	22.28	65.97	42.89	66.79	69.45	51.72
granite-3.0-8b-instruct	bfloat16	48.36	42.26	50.1	55.81	28.1	65.54	42.26	39.5	50.77	62.84	66.03	55.91	55.7	70.4	39.12	17.08	64.5	38.4	61.8	71.55	48.4
granite-3.0-2b-instruct	bfloat16	37.15	23.44	36.92	46.08	24.01	55.3	23.44	40.17	37.37	47.81	42.79	55.2	24.4	66.8	26.61	21.41	64.63	30.13	46.14	65.35	32.39

Model	Precision	Model URL	Aggregate Score	Grammar (Avg.)	Knowledge (Avg.)	Reasoning (Avg.)	Math (Avg.)	Classification (Avg.)	AGREE	ANLI	ARC-Challenge	ARC-Easy	Belebele	CTKFacts	Czech News	Facebook Comments	GSM8K	Klokanek	Mall Reviews	MMLU	SQAD	Subjectivity	TruthfulQA
Llama-3.1-Nemotron-70B-Instruct-HF	bfloat16	https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	68.45	68.9	85.4	62.13	47.56	78.27	68.9	41.58	88.05	95.5	91.62	49.46	84.9	81.9	74.45	20.67	60.7	74.93	65.84	85.6	83.13
Meta-Llama-3.1-70B-Instruct	bfloat16	https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct	75.11	75.92	86.05	75.33	60.11	78.15	75.92	62.33	89.25	96.97	91.51	73.12	84.7	78	85.06	35.15	60.9	76.81	74.38	89	81.16
Meta-Llama-3.1-8B-Instruct	bfloat16	https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct	55.19	53.11	59.37	66.81	20.81	75.86	53.11	48.33	65.44	78.83	82.79	65.77	79.2	74.5	17.36	24.26	65.23	50.73	70.34	84.5	42.49
claude-3-5-sonnet-20240620	other	https://docs.anthropic.com/en/docs/about-claude/models#model-names	77.69	93.94	91.78	76.83	46.47	79.42	93.94	61.67	94.71	97.6	94.53	73.66	84.8	80.2	51.48	41.46	59.87	84.8	77.46	92.8	90.02
claude-3-haiku-20240307	other	https://docs.anthropic.com/en/docs/about-claude/models#model-comparison-table	70.32	82.14	74.36	69.13	51.82	74.13	82.14	48.75	77.22	86.87	88.6	70.25	79.4	74.1	76.04	27.6	62.13	67.1	68.92	80.9	66.26
claude-3-opus-20240229	other	https://docs.anthropic.com/en/docs/about-claude/models#model-names	79.07	92.98	87.78	75.98	59.83	78.78	92.98	65	91.98	95.33	94.53	69.18	83.7	80.6	74.98	44.68	60	77.37	75.21	90.8	86.45
gemma-2-27b-it	bfloat16	https://huggingface.co/google/gemma-2-27b-it	73.57	78.15	81.44	75.49	54.29	78.49	78.15	62.92	87.03	94.28	91.73	72.58	81.6	79.6	79	29.58	63.2	70.94	74.73	89.55	73.52
gemma-2-2b-it	bfloat16	https://huggingface.co/google/gemma-2-2b-it	49.97	44.66	53.7	58.76	24.82	67.93	44.66	42.83	52.13	68.06	67.71	57.71	61.7	74.6	27.37	22.28	65.97	42.89	66.79	69.45	51.72
gemma-2-9b-it	bfloat16	https://huggingface.co/google/gemma-2-9b-it	65.62	65.87	74.38	73.05	39.16	75.63	65.87	56.58	82.08	92.13	90.39	69.89	79.9	76.6	50.72	27.6	63.23	59.26	75.33	82.8	64.04
gpt-4o-2024-08-06	other	https://platform.openai.com/docs/models/gpt-4o	78.62	93.46	84.74	75.84	58.32	80.74	93.46	65.25	94.54	97.9	93.85	70.61	86.1	84.1	84.84	31.81	60.3	64.51	73.67	92.45	82.02
gpt-4o-mini-2024-07-18	other	https://platform.openai.com/docs/models/gpt-4o-mini	72.31	76.87	80.57	70.77	54.33	78.99	76.87	54.5	87.71	95.12	89.94	63.8	82.8	81.4	80.44	28.22	59.67	68.87	74.85	92.1	70.57
granite-3.0-2b-instruct	bfloat16	https://huggingface.co/ibm-granite/granite-3.0-2b-instruct	37.15	23.44	36.92	46.08	24.01	55.3	23.44	40.17	37.37	47.81	42.79	55.2	24.4	66.8	26.61	21.41	64.63	30.13	46.14	65.35	32.39
granite-3.0-8b-instruct	bfloat16	https://huggingface.co/ibm-granite/granite-3.0-8b-instruct	48.36	42.26	50.1	55.81	28.1	65.54	42.26	39.5	50.77	62.84	66.03	55.91	55.7	70.4	39.12	17.08	64.5	38.4	61.8	71.55	48.4

Dataset	Language	Task type	Metrics	Samples	Task ID
AGREE	CS (Original)	Subject-verb agreement	Acc	627	agree_cs
ANLI	CS (Translated)	Natural Language Inference	Acc, Macro F1	1200	anli_cs
ARC Challenge	CS (Translated)	Knowledge-Based QA	Acc	1172	arc_cs
ARC Easy	CS (Translated)	Knowledge-Based QA	Acc	2376	arc_cs
Belebele	CS (Professional translation)	Reading Comprehension / QA	Acc	895	belebele_cs
CTKFacts	CS (Original)	Natural Language Inference	Acc, Macro F1	558	ctkfacts_cs
Czech News	CS (Original)	News Topic Classification	Acc, Macro F1	1000	czechnews_cs
Facebook Comments	CS (Original)	Sentiment Analysis	Acc, Macro F1	1000	fb_comments_cs
GSM8K	CS (Translated)	Mathematical inference	EM Acc	1319	gsm8k_cs
Klokánek	CS (Original)	Math/Logical Inference	Acc	808	klokanek_cs
Mall Reviews	CS (Original)	Sentiment Analysis	Acc, Macro F1	3000	mall_reviews_cs
MMLU	CS (Translated)	Knowledge-Based QA	Acc	12408	mmlu_cs
SQAD	CS (Original)	Reading Comprehension / QA	EM Acc, BoW F1	843	sqad_cs
Subjectivity	CS (Original)	Subjectivity Analysis	Acc, Macro F1	2000	subjectivity_cs
TruthfulQA	CS (Translated)	Knowledge-Based QA	Acc	813	truthfulqa_cs

🇨🇿 CzechBench Leaderboard

Basic Information

Evaluation Process

1. Install CzechBench:

2. Run evaluation

3. Upload results to Leaderboard

✉️✨ Submit your model here!