--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - reinforcement-learning - multi-agent - self-play - reasoning base_model: - Qwen/Qwen3-4B ---

# MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs ### 🎉 Accepted by ICLR 2026 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) [![arXiv](https://img.shields.io/badge/arXiv-2510.15414-b31b1b.svg)](https://arxiv.org/abs/2510.15414) [**🌐 Project Page**](https://thu-nics.github.io/MARSHAL/) | [**📝 Paper**](https://arxiv.org/abs/2510.15414) | [**💻 Code**](https://github.com/thu-nics/MARSHAL)

--- ## 🤗 Model Description This is the **generalist model** of the **MARSHAL** framework, initialized from **Qwen3-4B**. It has been trained via self-play on a diverse set of strategic games—**Tic-Tac-Toe**, **Kuhn Poker**, and **Mini Hanabi**—encompassing both competitive and cooperative dynamics, as well as perfect and imperfect information settings. ## 📖 Overview We introduce **MARSHAL**, an end-to-end reinforcement learning framework designed to incentivize **M**ulti-**A**gent **R**easoning through **S**elf-play wit**H** str**A**tegic **L**LMs in a diverse range of competitive and cooperative games. MARSHAL addresses the challenge of credit assignment in multi-agent multi-turn self-play through two core mechanisms: 1. **Turn-level Advantage Estimator:** Enables fine-grained credit assignment, allowing the model to accurately attribute long-term outcomes to individual actions and provide learning signals across multiple turns. 2. **Agent-specific Advantage Normalization:** Stabilizes the training process by calibrating advantage estimates relative to the performance of each agent. ### 🔥 Key Results By leveraging self-play across strategic games, MARSHAL (based on Qwen3-4B) demonstrates notable generalization capabilities: - **Strategic Games:** Achieves up to **28.7%** performance improvement on held-out games. - **Reasoning Benchmarks:** When integrated into leading multi-agent systems (MASs), MARSHAL yields consistent gains of up to - **+10.0%** on AIME - **+7.6%** on GPQA-Diamond - **+3.5%** on average across all tested benchmarks. ### 🎮 Featured Games - **Competitive, perfect-information:** Tic-Tac-Toe, Connect Four. - **Competitive, imperfect-information:** Kuhn Poker, Leduc Hold'em. - **Cooperative, imperfect-information:** Mini Hanabi, Simple Hanabi. --- ## 🚀 Method

> **Figure 1: Overview of MARSHAL.** > **Left:** Generating player trajectories via self-play in strategic games. > **Middle:** Naive advantage estimation (e.g., GRPO) often fails in multi-turn settings. > **Right:** MARSHAL's advantage estimation ensures accurate credit assignment for multi-turn, multi-agent interactions. ## 📊 Results

> **Figure 2: Performance Comparison.** > Evaluation of MARSHAL against baselines on strategic games and reasoning benchmarks. MARSHAL not only masters strategic games but also generalizes effectively to complex reasoning tasks within multi-agent frameworks like MAD and AutoGen. --- ## 🖊️ Citation If you find our work helpful, please cite: ```bibtex @misc{yuan2025marshal, title={MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs}, author={Huining Yuan and Zelai Xu and Zheyue Tan and Xiangmin Yi and Mo Guang and Kaiwen Long and Haojia Hui and Boxun Li and Xinlei Chen and Bo Zhao and Xiao-Ping Zhang and Chao Yu and Yu Wang}, year={2025}, eprint={2510.15414}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={[https://arxiv.org/abs/2510.15414](https://arxiv.org/abs/2510.15414)}, }