Stack the pots
In-Distribution
OOD #1
OOD #2
OOD #3
Robotics research continues to be constrained by data scarcity. Even the largest robot datasets are orders of magnitude smaller and less diverse than the datasets that have driven recent progress in language and vision. At the same time, the internet contains an abundance of egocentric human videos—capturing real-world, long-horizon manipulation skills in diverse settings. The challenge is that these videos do not contain any action labels, and they depict humans, not robots, leading to a large visual embodiment gap. Masquerade addresses this gap by transforming in-the-wild human videos into visually consistent “robotized” demonstrations, and then using them, in combination with real robot data, to train robust manipulation policies that generalize to unseen environments.
Masquerade first edits in-the-wild human videos to bridge the visual embodiment gap between humans and robots.
We pre-train a ViT-based encoder on the edited clips to predict future robot keypoints, then co-train it with a diffusion-policy head on 50 robot demonstrations from only one scene, continuing the pre-training objective during policy learning.
Below we show some sample edited videos from the Epic Kitchens dataset.
We evaluate Masquerade on three long-horizon, bimanual kitchen tasks. Each policy is trained on demonstrations from a single scene and tested in three OOD scenes. Videos are at 5x speed.
(The stop-and-go motion arises from action chunking in Diffusion Policy. While it can be reduced by compensating for model inference latency during rollouts, execution speed was not the focus of this work.)
In-Distribution
OOD #1
OOD #2
OOD #3
In-Distribution
OOD #1
OOD #2
OOD #3
In-Distribution
OOD #1
OOD #2
OOD #3
We compare our method to three existing vision representations: (1) HRP, a model that was finetuned on 150K egocentric in-the-wild human videos, (2) ImageNet, and (3) DINOv2. We evaluate each model over three OOD scenes (10 rollouts per scene, 30 total per method). Masquerade strongly outperforms all baselines in every OOD scene we test by an average of 62 percentage points (12% to 74%).
We evaluate the importance of two components—editing human videos with robot overlays and co-training on both human and robot data—by systematically removing each. In one variant, we replaced edited clips with the original, unmodified human videos while keeping all other settings identical. In another, we removed co-training, fine-tuning only on robot demonstrations. Both changes resulted in substantial drops in performance, showing that explicit visual editing and continued co-training are essential for Masquerade’s performance.
![]() |
![]() |
To confirm the contribution of edited human videos to policy learning, we measured performance as a function of the amount of co-training data. As the figure above (right) shows, success rates rise steadily with more human-video data. This clear upward trend demonstrates that increasing the amount of in-the-wild human videos directly boosts robot performance and suggests further gains could be realized by scaling beyond the current dataset size.
We compare the performance of our method on the original in-distribution training scene and OOD Scene 1 for the Sweep Chilis task. Unlike all baselines, which suffer large drops, Masquerade maintains similar in-distribution and out-of-distribution performance—demonstrating its robustness to scene shifts.
@inproceedings{lepert2025masqueradevideos,
title={Masquerade: Learning from In-the-wild Human Videos using Data-Editing},
author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
year = {2025},
eprint={2508.09976},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.09976},
}