VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
Video-Action Models (VAMs) enable strong long-horizon task performance via visual reasoning, but fail to capture fine-grained force and contact information critical for contact-rich physical interactions. This work introduces VTAM, a Video-Tactile-Action Model that integrates tactile sensing to address the limitations of vision-only VAMs. The model enables more precise and stable behavior in scenarios where critical interaction states are not fully observable from vision alone.