TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents Paper • 2606.28480 • Published 5 days ago • 42
Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent Paper • 2606.30616 • Published 1 day ago • 64
Agentic Abstention: Do Agents Know When to Stop Instead of Act? Paper • 2606.28733 • Published 4 days ago • 113
CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies Paper • 2606.16613 • Published 16 days ago • 9
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 7 days ago • 45
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning Paper • 2606.26790 • Published 6 days ago • 51
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 8 days ago • 141
Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation Paper • 2606.18844 • Published 14 days ago • 18
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Paper • 2606.23654 • Published 9 days ago • 78
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 10 days ago • 95
EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory Paper • 2606.21649 • Published 12 days ago • 32
SkillHarness: Harnessing Safe Skills for Computer-Use Agents Paper • 2606.20636 • Published 29 days ago • 20
Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills Paper • 2606.11897 • Published 21 days ago • 11