ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

Abstract

Enabling robots to acquire complex manipulation skills remains a great challenge, primarily bottlenecked by the prohibitive cost of collecting large-scale robot demonstration data. Humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution.

Simulation Videos (2x Speed)

  1. Open the drawer
  2. Turn off the lightbulb
  3. Rotate the red block to the right
  4. Lift the red block from the table
  5. Place the red block into the drawer
  1. Open the drawer
  2. Move the slider to the right
  3. Lift the red block from the slider
  4. Place the red block into the drawer
  5. Turn off the lightbulb

Real-world Videos (1x Speed)

MoveContainer
PickEggplant

Real-world Long Horizon (1x Speed)

MoveContainer → PickEggplant

Real-world Flow Representation Comparison

Decoded t
Ground truth zt
Decoded t
Ground truth zt
MoveContainer PickEggplant