arxiv:2606.12384

APPO: Agentic Procedural Policy Optimization

Published on Jun 10

· Submitted by

Wang Yong on Jun 15

Upvote

Authors:

Xucong Wang ,

Abstract

Agentic Reinforcement Learning method that improves multi-turn tool-use capabilities by refining branching decisions and credit assignment through fine-grained decision points and procedure-level advantage scaling.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

View arXiv page View PDF GitHub 61 Add to collection

Community

seashell11

Paper submitter 3 days ago

APPO: Agentic Procedural Policy Optimization
Github: https://github.com/AMAP-ML/APPO

noahml

3 days ago

Neat paper. I like the shift away from relying on tool-call boundaries for credit assignment. It makes a lot of sense that agentic decisions are more distributed than standard RL workflows typically assume, so targeting fine-grained decision points seems like a natural evolution for improving performance.

I'm curious, how does the Branching Score perform in scenarios where the task requires long-range reasoning versus quick tool interaction?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/72f53107-572c-4cd8-baa0-265b719c5f14

avahal

1 day ago

the most interesting bit for me is the Branching Score, which neatly combines local token uncertainty with the policy-induced future continuation gains. that blend feels like a clean way to steer exploration toward genuinely consequential decision points rather than tool-call boundaries. one worry i have is how sensitive bs is to the horizon used for the future gains, and whether early branches could bias the downstream advantage estimates. the arxivlens breakdown helped me parse the method details, it covers how bs ties into branching and advantage signals (https://arxivlens.com/PaperView/Details/appo-agentic-procedural-policy-optimization-2625-63308724). a quick ablation showing how much of the gain comes from the uncertainty term versus the future-gain term would go a long way to convincing me this is really the lever here.