Papers
arxiv:2606.02373

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Published on Jun 1
ยท Submitted by
Patrick Jiang
on Jun 2
ยท chromadb chroma
Authors:
,
,
,
,
,
,
,

Abstract

A 20B search agent trained with reinforcement learning within a stateful search framework demonstrates superior retrieval performance across multiple domains by separating semantic decision-making from environmental bookkeeping.

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

Community

๐Ÿ”ฅ Introducing Harness-1 ๐Ÿ”ฅ

Harness-1 is a 20B open search agent trained with state-externalizing harnesses, matching or outperforming several much larger frontier-model searchers on difficult retrieval tasks.

Harness-1 performance

Motivation

Many search agents are trained over growing transcripts. As a result, the model has to search while also doing a lot of implicit bookkeeping:

  • remembering candidate documents,
  • tracking useful evidence,
  • maintaining verification status,
  • recalling search history,
  • and avoiding repeatedly revisiting what has already been seen.

This makes the model responsible not only for search decisions, but also for managing the entire search state inside its context.

Key idea

Harness-1 separates these responsibilities.

The policy still makes the semantic decisions:

  • what to search,
  • what to inspect,
  • what to curate,
  • what to verify,
  • and when to stop.

But the harness maintains the recoverable search state around those decisions, including candidate pools, curated evidence, evidence links, verification records, and budget-aware context rendering.

With this setup, RL does not need to teach the model to manage an unstructured transcript from scratch. Instead, it trains the model to operate over a structured search workspace.

Results

Across 8 difficult retrieval benchmarks, Harness-1 reaches 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points, while remaining competitive with much larger frontier-model searchers.

The most interesting result to us is transfer: the gains are substantially larger on held-out transfer benchmarks than on source-family benchmarks. Ablations also show that removing the harness mechanisms changes agent behavior and hurts recall.

Takeaway

For search agents, the model is not the whole learning system.

The harness โ€” memory layout, action space, curation interface, verification records, and context rendering โ€” is part of what RL learns to use.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02373 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02373 in a Space README.md to link it from this page.

Collections including this paper 1