Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation

Huanyu Li^1,2* Kun Lei^1,2* Sheng Zang⁴ Kaizhe Hu^1,3 Yongyuan Liang⁶ Bo An⁴ Xiaoli Li⁵ Huazhe Xu^1,3

¹Shanghai Qi Zhi Institute. ²Shanghai Jiao Tong University. ³IIIS, Tsinghua University. ⁴Nanyang Technological University. ⁵A*STAR Institute for Infocomm Research. ⁶University of Maryland, College Park. ^*Equal contribution.

Paper arXiv Twitter Code(comming soon)

Abstract

Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, intervention-requiring failures (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing such failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces intervention-requiring failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training.

Method

Overview of the FARL offline-to-online failure-aware RL pipeline.

FARL is a failure-aware offline-to-online RL framework with two phases:

Offline: Pre-train task policy \(\pi_{\text{task}}\), recovery policy \(\pi_{\text{rec}}\), and a world model.
- Policies: behavior cloning → offline PPO-style fine-tuning.
- World model: predicts short-horizon rewards, values, and a constraint signal capturing near-future failure risks.
Online: Fix \(\pi_{\text{rec}}\) and world model; only fine-tune \(\pi_{\text{task}}\) in the real world.
- World model rolls out \(\pi_{\text{task}}\) and estimates near-future failure cost \(C_H\).
- If \(C_H \le \varepsilon_{\text{safe}}\): execute task action; otherwise: switch to recovery.
- Task policy updated via PPO on safety-filtered transitions.

FARL in the Real World

We demonstrate how FARL and the baseline behave during real-world online training under identical conditions. Starting from the same pre-trained policy and with the same RL exploration noise level, FARL actively detects risky situations and triggers recovery to avoid intervention-requiring failures, while the baseline frequently fails.

Franka Fragile Push Wall

The robot must push a fragile object to a target behind a wall. Collision with the wall may damage the object and is considered a failure.

Vanilla PPO

FARL

Franka Disturbed Push

The robot must push an object to a target while avoiding a dynamic obstacle (a decorative flower) randomly moved by a human, simulating dynamic, unstructured environmental changes.

Vanilla PPO

FARL

Franka Bounded Soccer

The robot uses a UMI gripper to kick a ball toward a target on an uneven turf surface. If the ball rolls beyond the boundaries due to irregular surface dynamics, it is considered a failure.

Vanilla PPO

FARL

Why is Bounded Soccer Challenging?

A scripted policy pushing straight forward on the uneven turf — the ball's trajectory is highly unpredictable.

Real-World Training Process

We show real-world recordings of the full real-world reinforcement learning training process (50 episodes) of FARL and the baseline on real-world tasks.

Franka Fragile Push Wall

Vanilla PPO

FARL

Franka Disturbed Push

Vanilla PPO

FARL

Franka Bounded Soccer

Vanilla PPO

FARL

Total Intervention-Requiring Failures

Total number of real-world intervention-requiring failures over the full online training process (50 episodes) on the three Franka tasks. FARL (“Ours”) consistently incurs substantially fewer failures than the Uni-O4 baseline (with vanilla PPO online post-training).

FailureBench for Evaluating Failure-Aware RL

We designed four realistic failure scenarios that suffer from intervention-requiring failures, forming FailureBench, a benchmark suite built on MetaWorld for evaluating failure-aware RL algorithms.

Bounded Push

Push puck to target within boundary. Crossing the boundary triggers a failure.

Bounded Soccer

Hit ball into goal within boundary. The dynamic ball is prone to failures.

Fragile Push Wall

Push fragile object to target behind wall. Collision with wall triggers a failure. (Object-to-wall distance shown in corner)

Obstructed Push

Push object to goal while avoiding fragile vase. Collision with vase triggers a failure. (Min arm-to-vase distance shown in corner)

Simulation Online Training Demos

We compare typical scenarios during online training between FARL and the baseline Uni-O4 (using vanilla PPO during online). The baseline frequently triggers intervention-requiring failures, while FARL successfully recovers from risky situations. In FARL demos, a red border indicates the world model predicts a potential failure ahead.

Bounded Push

Baseline

FARL

Bounded Soccer

Baseline

FARL

Fragile Push Wall

Baseline

FARL

Obstructed Push

Baseline

FARL

BibTeX

@misc{li2026failureawarerlreliableofflinetoonline,
      title={Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation}, 
      author={Huanyu Li and Kun Lei and Sheng Zang and Kaizhe Hu and Yongyuan Liang and Bo An and Xiaoli Li and Huazhe Xu},
      year={2026},
      eprint={2601.07821},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.07821}, 
}

Contact

Feel free to contact Huanyu Li if you have any questions on this project.