OmniVTLA: Teaching Robots to Feel What They See

This presentation introduces OmniVTLA, a breakthrough vision-tactile-language-action model that gives robots the ability to combine touch with vision and language understanding. By semantically aligning tactile sensing with visual and language data through a dual-path encoder architecture and the new ObjTac dataset, OmniVTLA achieves a 21.9% improvement in contact-rich manipulation tasks. The talk explores how this semantic alignment transforms robotic precision in tasks requiring physical interaction, demonstrating that touch-aware AI can dramatically improve task success rates and motion smoothness.
Script
Most robots navigate the world through vision alone, like trying to cook while wearing thick gloves. The authors of this paper ask: what if robots could feel what they touch and understand those sensations as deeply as they understand images and words?
Current vision-language-action models can recognize a wine glass and understand the instruction to pick it up, but they're blind to the pressure needed to grasp it without shattering. The authors identified that tactile data has been systematically excluded from the semantic understanding that makes modern robotic AI work.
So they built a model that speaks three sensory languages at once.
The breakthrough lies in semantic alignment. By training a specialized tactile vision transformer to understand touch in the same conceptual space where the model processes images and language, OmniVTLA translates pressure patterns into meaning. The ObjTac dataset provided the Rosetta Stone: thousands of objects captured simultaneously through text descriptions, video, and force-based tactile sensors.
In experiments with a UR5 arm and dexterous hand, OmniVTLA didn't just work better, it worked fundamentally differently. The 21.9% success rate improvement came from understanding when to adjust grip pressure mid-grasp and how to respond to unexpected contact, capabilities that vision alone cannot provide.
The authors acknowledge this is a foundation, not a finished cathedral. Complex assembly tasks, diverse robotic platforms, and more efficient tactile encodings remain open challenges that will determine whether touch-aware AI becomes ubiquitous or niche.
Robots that can feel are robots that can truly interact with our physical world, not just observe it from behind a camera lens. Visit EmergentMind.com to explore more research at the frontier of robotic perception and create your own videos.