Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

ZZichen Jeff CuiOOmar RayyanHHaritheja EtukuruBBowen TanZZavier AndrianarivoZZicheng TengYYihang ZhouKKrish MehtaNNicholas WojnoKKevin Yuanbo WuMManan H AnjariaZZiyuan WuMManrong MaoGGuangxun ZhangBBinit ShahYYejin KimSSoumith ChintalaLLerrel PintoNNur Muhammad Mahi Shafiullah

Published: February 9, 2026
Authors: 19
Word Count: 9,607

View on arXiv Download PDF

Contact-anchored policies beat language-based robot learning with 56% better performance.

Abstract

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/

Key Takeaways

1
Contact-anchored policies outperform vision-language models by 56% using only 23 hours of demonstration data.
2
Conditioning robot policies on precise 3D contact points is more efficient than using abstract language instructions.
3
The approach works across multiple robot embodiments without retraining, enabling practical deployment flexibility.

Limitations

Language-based models require massive computational overhead and billions of parameters for robot control.
Language cannot convey precise spatial information needed for accurate robot manipulation tasks.

Keywords

Contact-Anchored Policieslanguage conditioningphysical contactmodular utility modelsreal-to-sim iterationEgoGymmanipulation skillszero-shot evaluationdemonstration data

More in Robotics & Embodied AI

View all

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

Hongzhi Zang, Shu'ang Yu +15

Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheap...

Feb 846

RynnBrain: Open Embodied Foundation Models

Ronghao Dang, Jiayan Guo +24

Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning ...

Feb 1336

RoboPocket: Improve Robot Policies Instantly with Your Phone

Junjie Fang, Wendi Chen +8

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predo...

Mar 530

SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation

Mu Huang, Hui Wang +6

Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Exist...

Feb 228

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Runpei Dong, Ziyan Li +2

Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D...

Feb 1826

More Robotics & Embodied AI papers