Leveraging unlabeled or weakly labeled egocentric video via self-supervised objectives

Investigate whether and how incorporating weaker or unlabeled egocentric human video through self-supervised objectives during pretraining improves the performance and generalization of the EgoScale flow-based Vision–Language–Action policy for dexterous manipulation.

Background

The EgoScale framework shows that large-scale, action-labeled egocentric human data can serve as a scalable supervision source for dexterous manipulation, with validation loss reductions predicting robot performance. However, acquiring action labels at extreme scales is costly, and the authors suggest that weaker or unlabeled data paired with self-supervised learning could further amplify benefits.

This raises a concrete open question about the effectiveness of self-supervised objectives on large unlabeled egocentric datasets and their impact on dexterous manipulation performance and generalization, which the paper does not resolve.

References

Looking forward, several directions remain open. As egocentric human data continues to grow, incorporating weaker or unlabeled video via self-supervised objectives may further amplify these benefits.

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data  (2602.16710 - Zheng et al., 18 Feb 2026) in Conclusion