Abstract
Agentic Reinforcement Learning method that improves multi-turn tool-use capabilities by refining branching decisions and credit assignment through fine-grained decision points and procedure-level advantage scaling.
Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.
Community
Neat paper. I like the shift away from relying on tool-call boundaries for credit assignment. It makes a lot of sense that agentic decisions are more distributed than standard RL workflows typically assume, so targeting fine-grained decision points seems like a natural evolution for improving performance.
I'm curious, how does the Branching Score perform in scenarios where the task requires long-range reasoning versus quick tool interaction?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/72f53107-572c-4cd8-baa0-265b719c5f14
the most interesting bit for me is the Branching Score, which neatly combines local token uncertainty with the policy-induced future continuation gains. that blend feels like a clean way to steer exploration toward genuinely consequential decision points rather than tool-call boundaries. one worry i have is how sensitive bs is to the horizon used for the future gains, and whether early branches could bias the downstream advantage estimates. the arxivlens breakdown helped me parse the method details, it covers how bs ties into branching and advantage signals (https://arxivlens.com/PaperView/Details/appo-agentic-procedural-policy-optimization-2625-63308724). a quick ablation showing how much of the gain comes from the uncertainty term versus the future-gain term would go a long way to convincing me this is really the lever here.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping (2026)
- PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning (2026)
- Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement (2026)
- Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning (2026)
- AIPO: Learning to Reason from Active Interaction (2026)
- Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning (2026)
- GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.12384 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper