WorldCuisines Leaderboard

Which Visual Language Model (VLM) is the BEST on understanding culture through food?
๐Ÿ† Welcome to the WorldCuisines leaderboard! The leaderboard evaluates VLM's multilinguality and multicultural understanding based on dishes around the world.

We provide small test set and large test set. Both test sets contain the following tasks:
  • Dish Name (No Context): predict the name of a dish based on its image and question without any context.
  • Dish Name (Contextualized): predict the name of a dish based on its image and question with additional context information.
  • Dish Name (Adversarial): predict the name of a dish based on its image and the question with adversarial context.
  • Location: predict the location where the food is commonly consumed and originated given the dish image, question, and a context.

Each test set has two settings:

  • MCQ: multiple choice questions
  • OEQ: open-ended questions

How to evaluate your model and submit your results?
Please refer to the guideline in Github README to evaluate your own model (soon to be released).

โ„น๏ธ The model utilizes an optimized prompt (Check our repository for details) instead of the original one.

Model
Avg
Dish Name (No Context)
Dish Name (Contextualized)
Dish Name (Adversarial)
Location

Llama 3.2 Instruct 90B

81.17
78.17
90.43
82.23
56.73