ManiFlow

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

ManiFlow performs diverse dexterous skills on humanoid and bimanual robot systems, generalizing to unseen objects and backgrounds. All skills in videos are autonomous in 4x speed.

ManiFlow: A General Robot Manipulation Policy
via Consistency Flow Training

1University of Washington 2University of California San Diego   3Nvidia   4Allen Institute for Artifical Intelligence

Conference on Robot Learning (CoRL) 2025

 arXiv  PDF  Code  TL;DR  YouTube

Abstract

We introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets.



Overview of ManiFlow

ManiFlow Pipeline

ManiFlow processes 2D or 3D visual observations, robot state, or language as inputs and outputs a sequence of actions. We leverage a DiT-X transformer architecture to efficiently optimize a flow matching model with a continuous-time consistency training objective, ensuring high-quality action generation for challenging dexterous tasks.


Consistency Flow Training

Given a flow path that smoothly transforms action to noise, we sample multiple intermediate points via linear interpolation. During training, we learn to map any intermediate point on the flow trajectory back to its origin and ensure the self-consistency of sampled points on the same trajectory. With consistency flow training, ManiFlow can generate high-quality action sequences in just 1-2 inference steps, significantly improving efficiency and robustness compared to traditional diffusion or flow matching models that require 10-100 steps.

Policy Generalization

Our policy demonstrates strong generalization capabilities across different objects and scenes. Below we showcase examples of how our policy generalizes to novel objects and environments.

Generalization to Diverse Unseen Objects

Humanoid Pouring
Pouring Task 1
Pouring Task 2
Pouring Task 3
Pouring Task 4
Humanoid Grasping
Grasping Task 1
Grasping Task 2
Grasping Task 3
Grasping Task 4
Grasping Task 5
Grasping Task 6
Grasping Task 7
Grasping Task 8
Grasping Task 9
Bimanual Pouring
Bimanual Pouring Task 1
Bimanual Pouring Task 2
Bimanual Pouring Task 3
Bimanual Handover Task 4
Bimanual Grasping
Bimanual Grasping Task 1
Bimanual Grasping Task 2
Bimanual Grasping Task 3
Bimanual Grasping Task 4
Bimanual Grasping Task 5
Bimanual Grasping Task 6

Generalization to Unseen Backgrounds

Pouring in New Scenes
Seen Background
Unseen Background 1
Unseen Background 2
Scene 4
Scene 3
Scene 4
Grasping in New Scenes
Scene 1
Scene 1
Scene 2
Scene 3
Scene 4

Policy Robustness

ManiFlow demonstrates robustness across diverse dexterous manipulation tasks. The policy performs long-horizon three-layer cup stacking with consistent precision, exhibits resilience when human intervenes by adding new distractors and rearranging existing objects during task execution, and sorts objects in a long sequence by shape properties regardless of color or initial positioning.

Precision Stacking
Long-horzon stacking with precision.
Grasp with Human Intervention
Grasp diverse toys with human intervention and distractors.
Sort Shapes
Sort different shapes in sequence.
Any order
Sort shapes in any different order.

ManiFlow vs. π0

We evaluate ManiFlow and π0 with multiple bimanual dexterous tasks on RoboTwin 2.0 benchmark with challenging domain randomizations, including cluttered scenes with random distractors, novel objects, diverse background textures, various lighting conditions, and table height changes, after training with 50 domain randomized demonstrations per task. Compared to the large-scale pre-trained π0 model, ManiFlow shows superior learning efficiency and generalization capability, while only learning from scratch with pointcloud input. ManiFlow demonstrates strong robustness across all randomization conditions, maintaining high success rates even when multiple domain shifts occur simultaneously.

Lift Pot
π0
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
ManiFlow
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Pick Dual Bottles
π0
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
ManiFlow
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Put Object in Cabinet
π0
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
ManiFlow
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors
Randomized Scene & Distractors

Conclusions

In this work, we introduce ManiFlow, a robust and efficient dexterous manipulation model. ManiFlow improves upon prior flow matching policies by introducing a continuous-time consistency training objective, a superior time sampling strategy, and a novel DiT-X block. The proposed DiT-X architecture effectively handles diverse input modalities through its dual conditioning mechanisms, enabling strong performance across varied manipulation tasks. Our comprehensive evaluation spanning 64 simulation tasks and 8 real-world scenarios demonstrates ManiFlow's effectiveness, particularly in challenging real-world bimanual dexterous manipulation, where it achieves a 98.3% relative improvement over existing approaches.

Limitations

We observe that ManiFlow fails in tasks that require detailed contact information and precise force feedback, such as delicate assembly operations or compliant insertion tasks. This limitation stems from ManiFlow’s design focus on kinematic control rather than force-based interactions, lacking the tactile sensing and force control capabilities necessary for tasks where contact dynamics are critical for success. However, we believe incorporating tactile feedback as an additional modality would significantly enhance ManiFlow’s capability to handle more contact-rich manipulation tasks and broaden its applicability

Hardware

ManiFlow has been successfully deployed across three distinct real-world robot platforms, demonstrating superior performance in challenging manipulation tasks with increasing dexterity requirements.
Humanoid Robot (Unitree H1) - Full-sized humanoid with 28-DoF including 7-DoF arms, 6-DoF anthropomorphic Inspire hands, and 2-DoF active head for coordinated head-arm movements. Features gimbal-mounted ZED stereo camera for active egocentric sensing and Apple Vision Pro teleoperation via OpenTeleVision.
Bimanual System (Dual xArm7) - Two UFACTORY xArm 7 robotic arms equipped with PSYONIC Ability Hands (26-DoF total) for precise bimanual coordination and dexterous manipulation. Uses Intel RealSense LiDAR L515 camera with front-facing static view and Apple Vision Pro teleoperation through Bunny-VisionPro.
Single-Arm Setup (Franka Panda) - Industrial-grade 7-DoF Franka Emika Panda robot with Robotiq parallel gripper for high precision tasks. Equipped with statically mounted Intel RealSense D455 RGB-D camera and Oculus VR for data collection teleoperation.

BibTeX

@article{Yan2025ManiFlow,
  title   = {ManiFlow: A Dexterous Manipulation Policy via Flow Matching},
  author  = {Ge Yan and Jiyue Zhu and Yuquan Deng and Shiqi Yang and Ri-Zhao Qiu and Xuxin Cheng and Marius Memmel and Ranjay Krishna and Ankit Goyal and Xiaolong Wang and Dieter Fox},
  year    = {2025},
  journal = {arXiv preprint arXiv:}
}

Website modified from Behavior and iDP3.