Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Embodied-R1.5 Team
Yifu Yuan,✻, Yaoting Huang, Xianze Yao, Shuoheng Zhang, Linqi Han, Yutong Li, Pengyi Li, Jiangeng Sun, Wenting Jia, Yucheng Hu, Yuhao Liu, Ruihao Liao, Qiyu Wu, Yuxiao Li, Zhao Zhang, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma,✻, Hongyao Tang,✻, Han Hu, Jianye Hao
Project Leader    Corresponding Author
yuanyf@tju.edu.cn
Embodied Foundation Model General Physical Intelligence Unified Embodied Reasoning
Highlight
TL;DR Embodied-R1 Embodied-R1.5

Embodied-R1.5 is a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities — spanning embodied cognition, task planning, correction, and pointing — within a single 8B-parameter architecture toward general physical intelligence. It achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing Gemini-Robotics-ER-1.5 and GPT-5.4, and can be fine-tuned into a VLA that outperforms π₀.₅ across 4 popular manipulation benchmark suites.

Key Contributions
1

Unified Embodied Foundation Model

We formalize the capability requirements of an EFM and demonstrate that embodied cognition, task planning and correction, and pointing and location can coexist and reinforce each other within a single 8B-parameter architecture, eliminating the fragmented multi-model paradigm.

2

Large-Scale Data System & Multi-Task Balanced RL

Leveraging three automated data construction pipelines, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe with output-type-specific rewards that resolves interference inherent in heterogeneous multi-task training.

3

Planner-Grounder-Corrector Closed-Loop Framework

A Planner-Grounder-Corrector (PGC) closed-loop framework where a single model orchestrates the full autonomy stack — from high-level task decomposition and precise spatial grounding to autonomous self-correction over long-horizon tasks.

4

SOTA Results & Full Open-Source Ecosystem

Achieves SOTA on 16/24 embodied VLM benchmarks, surpassing Gemini-Robotics-ER-1.5 and GPT-5.4. Fine-tuned as a VLA, it outperforms π₀.₅ across 4 manipulation benchmark suites. We fully open-source model weights, datasets, training code, and EmbodiedEvalKit.

Embodied VLM Evaluation

Per-benchmark scores across all evaluated tasks.

VLA Benchmark Comparisons

Embodied-R1.5-VLA is built upon Embodied-R1.5 as the backbone, augmented with an action head for VLA training — without large-scale action pretraining. We demonstrate that broad improvements in embodied capabilities effectively transfer to downstream VLA manipulation performance.

Models
Embodied-R1.5 Framework
Closed-Loop Autonomy Framework for Long-horizon Tasks

Embodied-R1.5 integrates Planner, Grounder, and Corrector into a unified closed-loop pipeline, where a single model orchestrates the full autonomy stack — from high-level task decomposition and precise spatial grounding to autonomous reflection and self-correction. This enables complex, long-horizon embodied tasks across diverse robotic embodiments without human intervention.

Core Result

Compared with its predecessor, Embodied-R1.5 features comprehensive advancements in cognition, planning, and error correction, while significantly refining its pointing and localization precision. It currently establishes a new SOTA across 10+ benchmarks, with its mean performance markedly exceeding that of leading general-purpose and embodied models. Notably, Embodied-R1.5 demonstrates a clear competitive advantage over Gemini-Robotics-ER-1.5 across all evaluated capability domains.

Embodied VLM Leaderboard

Comprehensive evaluation across Embodied Planning and Correction (4 Tasks), Embodied Pointing and Location (9 Tasks), and Embodied Cognition and Spatial Reasoning (8 Tasks).

Visual trace prediction evaluation across ShareRobot-V (RMSE↓, DFD↓), VABench-V (RMSE↓, DFD↓), and PIO-S3 (GPT-Score↑).

General Vision Benchmarks

Embodied-R1.5 maintains strong general vision capabilities while acquiring comprehensive embodied skills, demonstrating minimal performance trade-off compared to its base model Qwen3-VL-8B.

Demonstrating superior generalization performance, Embodied-R1.5-VLA excels in all three Simpler evaluation suites. In the rigorous Google Robot Benchmark, our model outperforms the leading baseline, pi0, by a substantial 20% margin, setting a new performance standard for embodied agents.

Simpler-Google Robot Benchmark (Visual Matching)
Model Pick Coke Can Move Near Open/Close Drawer Open Top Drawer
& Place Apple
Overall
RT-1-X 56.731.759.721.342.4
RT-2-X 78.777.925.03.746.3
OpenVLA 16.346.235.60.024.5
SpatialVLA 86.077.957.40.055.3
pi0 97.978.762.346.671.4
pi0-FAST 75.367.542.90.046.4
GR00T N1.5 51.754.027.87.435.2
Embodied-R1.5-VLA 92.393.886.197.292.4
Simpler-Google Robot Benchmark (Variant Aggregation)
Model Pick Coke Can Move Near Open/Close Drawer Open Top Drawer
& Place Apple
Overall
RT-1-X 49.032.329.410.130.2
RT-2-X 82.379.235.320.654.4
OpenVLA 54.547.717.70.030.0
SpatialVLA 88.072.741.86.352.2
pi0 90.180.727.620.554.7
pi0-FAST 77.668.231.30.044.3
GR00T N1.5 69.368.735.84.044.5
Embodied-R1.5-VLA 80.672.258.375.071.5
Simpler-WidowX Benchmark (Visual Matching)
Model Put Spoon
on Towel
Put Carrot
on Plate
Stack Blocks Put Eggplant
in Basket
Overall
RT-1-X 0.04.20.00.01.1
CogACT 71.750.815.067.551.3
OpenVLA 4.20.00.012.54.2
SpatialVLA 16.725.029.2100.042.7
pi0 29.10.016.662.527.1
pi0-FAST 29.121.910.866.632.1
GR00T N1.5 75.354.357.061.362.0
Embodied-R1.5-VLA 83.375.037.5100.074.0

Efficient VLM-to-VLA Adaptation: Embodied-R1.5-VLA achieves high-performance control via direct fine-tuning from a VLM, successfully bypassing the conventional requirement for massive real-robot action pre-training.

Top-Tier Benchmark Dominance: The model demonstrates comprehensive task mastery across all LIBERO suites, reaching first-tier performance and significantly outperforming existing methods that lack action pre-training.

Superior Architectural Stability: Compared to native backbones like Qwen3-VL, Embodied-R1.5 delivers consistent and stable performance gains throughout the entire training curriculum.

LIBERO Benchmark (40 Tasks)
Model Pt. Goal Spatial Object Long Overall
W/ Action Pretraining
SpatialVLA Y78.688.289.955.578.1
CoT-VLA Y87.687.591.669.083.9
GR00T N1 Y93.094.497.690.693.9
GR00T N1.6 Y97.597.798.594.497.0
OpenVLA Y79.284.788.453.776.5
OpenVLA-OFT Y97.997.698.494.597.1
pi0 Y95.896.898.885.294.2
pi0-FAST Y88.696.496.860.285.5
pi0.5 Y98.098.898.292.496.9
W/O Action Pretraining
Diffusion Policy N68.378.392.550.572.4
OpenVLA-OFT N91.794.395.286.591.9
pi0-FAST N89.087.063.048.071.8
pi0.5 N94.696.697.285.893.6
Embodied-R1.5-VLA N97.497.899.293.297.3
LIBERO-Plus Robustness Benchmark

Evaluating model robustness across 7 perturbation types: Camera, Robot, Language, Light, Background, Noise, and Layout changes.

VLM Backbone Comparison on LIBERO

Overall success rates across training steps with different VLM backbones and action experts.

ManiSkill PartNet-Mobility — Affordance Prediction
Model Seen Categories
Safe
Safe
Door
Door
Display
Display
Fridge
Fridge
Laptop
Laptop
Lighter
Lighter
Microwave
Micro.
Mouse
Mouse
Box
Box
TrashCan
Trash
KitchenPot
Pot
Suitcase
Suitcase
Pliers
Pliers
Storage
Storage
Remote
Remote
Where2Act 0.260.360.190.270.230.110.150.470.140.240.130.120.560.680.07
FlowBot3D 0.670.550.200.320.270.310.610.680.150.280.360.180.210.700.18
Implicit3D 0.530.580.350.550.280.660.580.510.520.570.450.340.410.540.39
ManipLLM 0.680.640.360.770.430.620.650.610.650.520.530.400.640.710.60
Embodied-R1.5 0.980.630.600.780.780.630.900.630.940.410.980.280.340.840.50
Model Seen (cont.) AVG
(Seen)
Unseen Categories AVG
(Unseen)
Bottle
Bottle
FoldingChair
Chair
Toaster
Toaster
Lamp
Lamp
Dispenser
Disp.
Toilet
Toilet
Scissors
Scissors
Table
Table
Stapler
Stapler
Kettle
Kettle
USB
USB
WashingMachine
Washer
Oven
Oven
Faucet
Faucet
Phone
Phone
Where2Act 0.400.130.180.130.40 0.26 0.180.350.380.280.050.210.170.200.150.15 0.21
FlowBot3D 0.260.170.530.290.42 0.37 0.230.100.600.390.270.420.280.510.130.23 0.32
Implicit3D 0.430.270.650.200.33 0.46 0.450.170.800.530.150.690.410.310.300.31 0.41
ManipLLM 0.640.410.750.440.67 0.59 0.380.220.810.860.380.850.420.830.260.38 0.54
Embodied-R1.5 0.720.900.860.350.98 0.70 0.710.310.730.750.840.300.920.920.570.33 0.64
Safe
Door
Display
Fridge
Laptop
Lighter
Microwave
Mouse
Box
TrashCan
KitchenPot
Suitcase
Pliers
Storage
Remote
Bottle
Chair
Toaster
Lamp
Dispenser
Toilet
Scissors
Table
Stapler
Kettle
USB
Washer
Oven
Faucet
Phone

Embodied-R1.5 predicts interaction affordances across 30 object categories in ManiSkill PartNet-Mobility.

Real-World Experiments

Embodied-R1.5 demonstrates strong real-world manipulation capabilities across 5 diverse tasks, significantly outperforming prior models in both tool affordance understanding and complex multi-step reasoning.

Real-World Manipulation Results

Pick up [X] and put it on the plate
(Pick&Place)

Put the third duck toy from the left on the plate
(Spatial Reasoning)

Door Open
(Articulated Object Manipulation)

Move [X] to the empty space on the right side of the table
(Tool Affordance) #1

Move [X] to the empty space on the right side of the table
(Tool Affordance) #2

Move [X] to the empty space on the right side of the table
(Tool Affordance) #3

Demo Videos

Clean the vase

Open the Drawer

Plug the charger into the socket

Put all the clutter on the table into the box

Put the specified object into the empty space of the table

Stack the cups into three layers

Sweep the garbage on the ground into the dustpan

Make the milk tea

Take out the blue cup and place it on the coaster

EmbodiedEvalKit

A unified evaluation framework for assessing MLLM on embodied intelligence tasks. View on GitHub →

🚀

Multiple Inference Backends

vLLM (tensor parallel & multi-GPU), HuggingFace Transformers, and API — switch backends in one config.

🤖

20+ Models Supported

GPT, Gemini, Qwen, InternVL, Molmo, Magma and more — generalist and embodied-specialist families out of the box.

📊

20+ Embodied Benchmarks

Embodied QA, Spatial Reasoning, Embodied Pointing, Affordance and Location, and Embodied Planning — all in one place.

🗂️

Standardized Dataset Format

All benchmarks reorganized into HuggingFace Parquet — unified data pipeline for reproducible evaluation.

🎯

Unified Pointing Evaluation

One interface for diverse pointing formats, coordinate systems, and model-specific conventions.

🧩

Modular & Decoupled Design

Models, benchmarks, and metrics cleanly separated — easy to extend, swap, or customize independently.

Supported Benchmarks (20+)
ERQA CV-Bench EmbSpatial SAT RoboSpatial RoboVQA VABench-P Where2Place RefSpatial Part-Afford RoboRefit RoboAfford VSI-Bench OpenEQA EgoPlan2 PIOBench PointBench Pixmo Points Cosmos-Reason BLINK RoboFAC VABench-V PIO-S3 VLABench ...
Evaluated Models (20+)
Generalist
GPT Gemini Qwen InternVL Molmo ...
Embodied
Embodied-R1.5 Embodied-R1 RoboBrain Mimo Pelican Magma VeBrain ...
Acknowledgements and Contribution List

Yifu Yuan proposed the methodology and research direction, developed the complete training dataset, and was responsible for the entire pipeline, including algorithm implementation, model training, inference, and experimental analysis. Yifu Yuan also led the design of the EmbodiedEvalKit framework for evaluation and the drafting of the manuscript.

Yifu Yuan, Hongyao Tang, and Yi Ma served as co-project leads, overseeing the overall execution of the project. The corresponding authors are Shuyang Gu, Yi Ma, Hongyao Tang, and Jianye Hao.

The success of the Embodied-R1.5 project is a collective effort of all contributors. Yaoting Huang, Linqi Han, and Jiangeng Sun contributed to model evaluation. Data construction and cleaning were performed by Yaoting Huang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Yucheng Hu, Zhao Zhang, and Yuxiao Li. The real-world robotic platform setup and experiments were conducted by Shuoheng Zhang, Xianze Yao, Pengyi Li, Yuhao Liu, Yutong Li, Ruihao Liao, Qiyu Wu, and Yuxiao Li. Research guidance and supervision were provided by Shuyang Gu, Zibin Dong, Fei Ni, Yan Zheng, Han Hu, and Jianye Hao.

BibTeX
Embodied-R1.5
@article{yuan2026embodiedr15,
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
  author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Zhang, Shuoheng and Han, Linqi and Li, Yutong and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Hu, Yucheng and Liu, Yuhao and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
  journal={arXiv preprint},
  year={2026}
}
Embodied-R1 ICLR 2026
@article{yuan2025embodied,
  title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}
From Seeing to Doing ICLR 2026
@article{yuan2025seeing,
  title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}