24
March 2026

Daily Robotics Digest

10 curated items from arXiv, industry news, and the community

Executive Summary

This edition of the XRollout Daily Robotics Digest centers entirely on recent advances and open questions in vision-language-action (VLA) models and large model-powered embodied AI, a fast-growing area enabling general-purpose robotic capabilities. The collection spans foundational surveys, novel model architectures, evaluation and testing frameworks, and new approaches to humanoid control and long-horizon manipulation. Together, these works map out key open challenges and introduce practical technical improvements that will shape the near-term development of embodied robotic intelligence.

📄

New Research Papers

10 items
1

10 Open Challenges Steering the Future of Vision-Language-Action Models

This AAAI paper outlines 10 core open challenges that will guide the future development of vision-language-action (VLA) models for robotics. It highlights key risks including potential harm from embodied AI systems deployed in unstructured environments like disaster zones, while framing the central goal of enabling versatile general-purpose robotic manipulation. This work provides a research agenda for the community working to extend VLA paradigms to real-world embodied systems.

2

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

This preprint introduces EgoActor, a novel framework that grounds humanoid robot task planning into spatially aware egocentric actions using pretrained visual-language models. The approach integrates both locomotion policies and visual-language-action modeling to unify task planning and action execution, with manipulation capability treated as a core requirement for general embodied performance. It addresses the open challenge of better aligning high-level task reasoning with robot-specific egocentric action execution.

3

Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning

This paper presents a comprehensive survey of large model-powered embodied AI, focusing specifically on recent advances in decision-making and embodied learning for robotic systems. It covers a range of embodied platforms including humanoid and quadruped robots, and details how large model integration can enable precise, structured actions such as movement primitives for manipulation and navigation. The survey organizes current progress and identifies gaps to guide future research in large foundation model-powered robotics.

4

A Survey on Evaluation of Embodied AI

This work provides a systematic survey of evaluation methodologies for embodied AI, with a focus on benchmarking embodied manipulation capabilities that require physical tool use. It discusses how standardized evaluation protocols advance robots' embodied understanding of real-world environments, and introduces a new comprehensive real-world visual-language-action dataset to support more robust evaluation. This survey addresses a critical community need for consistent, reliable assessment of new embodied AI systems.

5

FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

FutureVLA is a new vision-language-action model that introduces joint visuomotor prediction under a unified robot embodiment framework. The method is validated on four manipulation tasks using a physical Franka Emika robot arm in the SimplerEnv simulation and real-world testbed. It improves spatial representation learning for VLA models, enabling more accurate visuomotor coordination for robotic manipulation.

6

Metamorphic Testing of Vision-Language Action-Enabled Robots

This preprint proposes a metamorphic testing framework for identifying flaws and reliability issues in vision-language-action enabled robotic systems. The framework is tested across two different robotic platforms and four standard manipulation benchmarks to demonstrate its generalizability. It addresses the understudied problem of validating and ensuring the reliability of VLA-powered robots before real-world deployment.

7

Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation

Action-Sketcher is a new approach to long-horizon robotic manipulation that bridges high-level reasoning and low-level action execution via intermediate visual sketches. The method leverages visual-language-action modeling and is validated on a diverse set of embodied reasoning benchmarks, showing improved performance on sequential, multi-step manipulation tasks. It addresses the common challenge of planning long-horizon tasks from raw language instructions by adding an interpretable intermediate reasoning step.

8

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

ST4VLA is a novel dual-system vision-language-action framework that incorporates explicit spatial priors into training for robotic manipulation. The approach explicitly integrates spatial knowledge into embodied robot control, improving alignment between visual-language understanding and motor execution. This method addresses the common limitation of standard VLA models that lack structured spatial awareness for precise physical interaction.

9

A Cognitive Architecture for Embodied AI

This book chapter outlines a new cognitive architecture design for embodied AI that builds on recent advances in visual-language-action modeling. It proposes that robots improve task performance through continuous iterative refinement, covering core capabilities including object manipulation, locomotion, and navigation. The work integrates VLA capabilities into a broader cognitive framework to support general-purpose embodied robot operation.

10

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

ZeroWBC is a new method for learning natural visuomotor humanoid control directly from unlabeled human egocentric video, without requiring supervised robot demonstration data. It addresses key limitations of existing humanoid control frameworks, including the lack of force feedback that limits precise manipulation performance, and connects to broader advances in visual-language-action modeling for embodied AI. The approach enables more data-efficient learning of general humanoid capabilities by leveraging abundant human video data.