ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

Yiyang Jin^2*,

Kunzhao Xu^3*,

Hang Li¹,

Xueting Han^1†,

Yanmin Zhou²,

Cheng Li³,

Jing Bai¹,

¹Microsoft Research Asia
²Tongji University
³University of Science and Technology of China

Introduction

We propose ReVeal, a multi-turn Reinforcement learning framework that interleaves code generation with explicit self-Verification and tool-based evaluation. ReVeal enables LLMs to autonomously generate test cases, invoke external tools for precise feedback, and improves performance via a customized RL algorithm with dense, per-turn rewards.As a result, ReVeal fosters the co-evolution of a model's generation and verification capabilities through RL training, expanding the reasoning boundaries of the base model, demonstrated by significant gains in Pass@k on LiveCodeBench. It also enables test-time scaling into deeper inference regimes, with code consistently evolving as the number of turns increases during inference. These findings highlight the promise of ReVeal as a scalable and effective paradigm for building more robust and autonomous AI agents.

ReVeal is a multi-turn RL framework that enables code agents to engage in an iterative generation-verification loop via RL training. This framework decomposes long-horizon reasoning into alternating generation and verification turns. It introduces dense, verifiable rewards at each turn, enabling fine-grained optimization of both code quality and verification accuracy. To prevent adversarial reward gaming (e.g., generating trivial code to hack verification rewards), ReVeal incorporates robustness mechanisms and a customized RL algorithm tailored for the generation-verification interplay. This turn-level supervision not only explicitly optimizes self-verification and iterative refinement, but also enables effective verification-driven test-time scaling.

Multi-turn rollout: Below is the ReVeal Performing Iterative Generation and Verification. Instead of relying on the commonly used <think>-<answer> prompting, ReVeal adopts a more structured prompting format that explicitly decouples generation, verification, and tool feedback into an iterative loop using distinct tags. As shown in the example, the model first reasons about the problem under <generation-think> and produces a candidate solution within <generation-answer>. It then initiates the verification phase, constructing a plan under <verification-think> by analyzing potential failure modes, edge conditions, and the intended behavior of the code. These test cases, either derived from the problem description or newly synthesized to expose likely errors, are specified under <verification-answer> for direct execution. The <tool-feedback> section captures execution results, including runtime errors, invalid test cases, as well as the expected output, actual output, and pass/fail judgment for each valid test case. This structured feedback provides fine-grained supervision and guidance for the next generation-verification cycle. Based on this feedback, the model identifies failed cases, diagnoses underlying errors, and revises both code solutions and verification plans accordingly. This process continues over multiple turns, allowing the model to progressively refine its outputs through next rounds of generation and verification - enabling self-improvement without requiring external critic models or predefined test cases.

Main Results

Model	LiveCodeBench (2025.02-2025.05)			CodeContests
Model	Pass@1	∆_↑	∆_↓	Pass@1	∆_↑	∆_↓
Exusting Baselines
Qwen2.5-32B-Instruct	24.8	-	-	13.3	-	-
DAPO-Qwen2.5-32B	31.1	-	-	18.5	-	-
Qwen2.5-Coder-32B-Instruct	29.5	-	-	14.6	-	-
w/ critic Qwen2.5-Coder	29.6	2.14	3.04	-	-	-
w/ critic GPT-4o	32.9	4.82	2.50	-	-	-
w/ critic CTRL	33.4	3.75	0.89	-	-	-
ReVeal based on DAPO-Qwen2.5-32B
Single-turn RL	32.8	-	-	21.0	-	-
ReVeal x25	38.7	7.50	0.0	33.6	15.69	0.0
Ablation Study: TAPO with Joint Verifiable Rewards
Single-turn x8 RL w/ outcome reward	36.1	4.69	1.32	27.4	9.24	2.36
ReVeal x8 w/TAPO with joint reward	37.7	5.62	0.0	30.4	12.30	0.0

Table 1: Performance comparison of ReVeal with baseline methods on LiveCodeBench and CodeContests. Pass@1 indicates the success rate; ∆_↑ and ∆_↓ represent the percentages of incorrect solutions corrected and correct solutions degraded after revision, respectively.

1. ReVeal Enables Test-time Scaling into Deeper Inference Regimes

Although the model is trained with a maximum of three reasoning turns, it continues to improve its solutions when more turns are allowed at inference time, leading to progressively higher code accuracy. For instance, Pass@1 increases from 34.8% at turn 1 to 36.7% at turn 3, and further rises to 38.7% by turn 25 for LiveCodeBench. This compellingly demonstrates how reliable self-verification and iterative environment feedback can enable compute scaling into deeper inference regimes, allowing ReVeal to solve previously intractable problems and evolve novel solutions. As a result, ReVeal supports self-improvement beyond the training horizon, enabling strong generalization in long-horizon reasoning during inference. Furthermore, these newly discovered solutions can be distilled back into the code LLM to further enhance its reasoning capabilities through continued training.

2. ReVeal Pushes Beyond the Reasoning Boundaries of the Base Model

We compare DAPO-Qwen2.5-32B and single-turn RL baseline with ReVeal using Pass@k metrics on LiveCodeBench. The RL baseline outperforms the base model when k < 32, but its performance gain gradually diminishes as k increases. In contrast, ReVeal consistently outperforms both the base model and the RL baseline across all k values from 1 to 128, demonstrating its ability to surpass the reasoning boundaries beyond the base model. We attribute this improvement to ReVeal's verification-driven exploration: tool-assisted verification provides targeted, execution-based feedback and precise judgments that guide the model to explore better solutions more effectively. With this enhanced exploration capability, the model continually self-evolves and grows beyond its initial reasoning capability during RL training. We believe this approach offers a promising path towards developing self-evolving agents with stronger reasoning capabilities.

3. ReVeal Co-evolves the Model's Generation and Verification Capabilities

Final code accuracy steadily improves throughout training and significantly surpasses the single-turn RL baseline. Moreover, comparing Fig4(a) and (b) reveals that final solutions consistently outperform those generated at Turn 1, with the performance gap widening over time. This trend indicates that as the model's verification ability strengthens, multi-turn refinement enables the exploration of better solutions, progressively enhancing its capacity to generate and refine code. During RL training, test case accuracy increases substantially, rising from approximately 50% to nearly 88%, as shown in Fig3(d). Additionally, for correctly generated test cases, the model achieves over 85% accuracy in judging code correctness. This demonstrates that during inference, the model can reliably generate valid test cases and effectively leverage tool to produce accurate verification signals, which are critical for continuous improvements in code quality. These results provide strong evidence that ReVeal jointly and effectively optimizes both generation and verification, enabling the model to evolve its reasoning capabilities throughout training.

@misc{jin2025revealselfevolvingcodeagents, title={ReVeal: Self-Evolving Code Agents via Reliable Self-Verification}, author={Yiyang Jin and Kunzhao Xu and Hang Li and Xueting Han and Yanmin Zhou and Cheng Li and Jing Bai}, year={2025}, eprint={2506.11442}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2506.11442}, }

ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

Introduction

Methodology

Experiment Results

Main Results

Extensions for ReVeal

1. ReVeal Enables Test-time Scaling into Deeper Inference Regimes

2. ReVeal Pushes Beyond the Reasoning Boundaries of the Base Model

3. ReVeal Co-evolves the Model's Generation and Verification Capabilities

Cite Us