27
March 2026

Daily Robotics Digest

36 curated items from arXiv, industry news, and the community

Executive Summary

This daily robotics digest features 36 new research preprints focused heavily on advances in vision-language-action (VLA) models, spanning autonomous driving, robotic manipulation, and aerial robotics. The collection also includes new work in navigation, multi-robot coordination, data generation for robot learning, human-robot interaction, and safety engineering, highlighting the field's rapid progress toward more generalizable, personalized, and adaptable robotic systems. This batch of preprints reflects the community's growing focus on integrating natural language interaction across robotic domains while addressing core challenges in data scaling, long-horizon performance, and robustness.

📄

New Research Papers

36 items
1

A Cognitive Architecture for Embodied AI

This book chapter outlines a cognitive architecture for embodied AI that centers on continuous performance refinement for robotic systems. It positions the work within recent advances in visual-language-action frameworks for core robotic tasks including locomotion and object manipulation. The work emphasizes the role of iterative improvement in enabling robust embodied robot behavior.

2

Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

Existing vision-language-action models for autonomous driving typically only use language for scene description or reasoning, and lack flexibility to follow diverse personalized user driving instructions. To address this gap, authors construct InstructScene, a large-scale dataset with 100,000 driving scenes annotated with natural language instructions and matching trajectories. They also introduce Vega, a unified vision-language-world-action model designed specifically for instruction-based personalized autonomous driving.

3

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li

Human driving behavior is inherently personalized, with consistent individual differences in acceleration, braking, merging, and other maneuvers across situations. Most current autonomous driving systems only optimize for generic performance or use fixed driving modes, and cannot adapt to individual user preferences or interpret natural language intent. Authors propose Drive My Way (DMW), a personalized vision-language-action driving framework that aligns with users' long-term driving preferences and natural language intents.

4

SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation

Masoud Moghani, Mahdi Azizian, Animesh Garg, Yuke Zhu, Sean Huver, Ajay Mandlekar

Large-scale datasets are critical for learning robotic manipulation skills, but collecting real-world data is time-intensive, costly, and difficult to scale. While synthetic simulation data has shown promise for reducing real-world data requirements, scalable data generation for deformable object manipulation remains underdeveloped. Authors introduce SoftMimicGen, a simulation-based data generation system designed to enable scalable robot learning for deformable object manipulation tasks.

5

Intelligent Navigation and Obstacle-Aware Fabrication for Mobile Additive Manufacturing Systems

Yifei Li, Ruizhe Fu, Huihang Liu, Guha Manogharan, Feng Ju, Ilya Kovalenko

Growing demand for mass customization requires more flexible additive manufacturing (AM) systems, which are currently limited by fixed equipment layouts. Integrating mobile robots with additive manufacturing creates mobile AM robots (MAMbots) that can adapt to changing production requirements to increase flexibility. This work develops an integrated system for intelligent navigation and obstacle-aware fabrication for MAMbots to enable flexible on-demand production.

6

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

Action-conditioned robot world models predict future scene frames from given action sequences, and are a promising alternative to traditional physics engines for hard-to-model tasks. However, these models are typically optimized for short-term prediction, and errors compound during autoregressive multi-step rollouts, leading to rapid degradation in visual quality. Authors introduce a reinforcement learning-based post-training scheme that stabilizes long rollouts, creating persistent robot world models that maintain visual quality over extended prediction horizons.

7

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan W...

While advanced finetuning methods for vision-language-action (VLA) models can improve performance and reduce convergence steps compared to standard supervised finetuning (SFT), they add significant computational overhead from auxiliary task losses. This work aims to achieve the performance benefits of auxiliary training while retaining the simplicity of standard SFT. Authors propose Fast-dVLA, an approach that decouples auxiliary and supervised training objectives to accelerate discrete diffusion VLA models to real-time performance.

8

A Mentalistic Interface for Probing Folk-Psychological Attribution to Non-Humanoid Robots

Giulio Pisaneschi, Pierpaolo Serio, Estelle Gerbier, Andrea Dan Ryals, Lorenzo Pollini, Mario G. C. A. Cimino

Human observers often adopt an 'intentional stance' toward robots, attributing intentional mental states to their behavior, but it is unclear how explanatory language shapes this attribution. This work introduces a controlled experimental platform for studying how different explanatory framing affects intentional-state attribution toward non-humanoid robots. The platform combines simulated robots, realistic task environments, and a large language model that can explain the same robot behavior in mentalistic, teleological, or mechanistic terms, enabling rigorous studies of human perception of robots.

9

Accurate Surface and Reflectance Modelling from 3D Radar Data with Neural Radiance Fields

Judith Treffler, Vladimír Kubelka, Henrik Andreasson, Martin Magnusson

Robust scene representation in low-visibility environments (such as fog, smoke, or dust) is critical for safe autonomous robot operation, and radar outperforms cameras and lidars in these conditions. However, radar data is inherently sparse and noisy, making reliable 3D surface reconstruction very challenging. Authors propose a neural implicit approach based on neural radiance fields that jointly models scene geometry and view-dependent radar intensities from 3D radar point clouds, enabling accurate surface and reflectance modeling for robust navigation in poor visibility.

10

Towards Generalizable Robotic Data Flywheel: High-Dimensional Factorization and Composition

Yuyang Xiao, Yifei Zhou, Haoran Wang, Wenxuan Ou, Yuxiao Liu

Limited diverse training data and low data efficiency remain major bottlenecks for developing generalist robotic models, and there are few systematic strategies for collecting and curating broadly useful datasets. Task diversity arises from implicit multi-dimensional factors that are difficult to explicitly define and sample for training. Authors propose F-ACIL, a factor-aware compositional iterative learning framework that enables structured factorization of existing robotic data and promotes compositional generalization, advancing the goal of a generalizable robotic data flywheel.

11

Towards Embodied AI with MuscleMimic: Unlocking full-body musculoskeletal motor learning at scale

Chengkun Li, Cheryl Wang, Bianca Ziliotto, Merkourios Simos, Jozsef Kovecses, Guillaume Durandau, Alexander Mathis

Scalable motor learning for muscle-driven musculoskeletal humanoid models has been held back by high computational costs for accurate biomechanical simulation and a lack of validated open full-body models. Authors introduce MuscleMimic, an open-source framework for scalable full-body musculoskeletal motor learning with physiologically accurate muscle-actuated humanoid models. The framework provides two validated embodiments (a 126-muscle upper-body model for bimanual manipulation and a 416-muscle full-body model for locomotion) plus a motion retargeting pipeline, unlocking large-scale research in embodied musculoskeletal motor control.

12

LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation

Motonari Kambara, Koki Seno, Tomoya Kaichi, Yanan Wang, Komei Sugiura

Flow-based trajectory generation for language-conditioned robotic manipulation enables training on human and web video data, and requires minimal embodiment-specific training data. The key challenge for this approach is aligning natural language instructions to generated object trajectories accurately. Authors propose LILAC, a flow-based vision-language-action model that generates object-centric 2D optical flow from pre-manipulation images and natural language instructions for open-loop trajectory generation, reducing data requirements for new manipulation tasks.

13

Visualizing Impedance Control in Augmented Reality for Teleoperation: Design and User Evaluation

Gijs van den Brandt, Femke van Beek, Elena Torta

Contact-rich robotic teleoperation remains challenging when using low-cost motion-only interfaces that lack haptic feedback, as operators cannot perceive or regulate contact forces effectively. To address this limitation, authors propose an augmented reality visualization that displays an impedance controller's target pose and its displacement from the robot end effector, conveying real-time contact force information to operators without haptic hardware. A user evaluation of the system confirms that the visualization improves task performance for contact-rich teleoperation tasks.

14

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

Roman Kueble, Marco Hueller, Mrunmai Phatak, Rainer Lienhart, Joerg Haehner

Semantic scene graphs enable embodied agents to reason about objects, their relations, and spatial context, which is critical for objective-driven self-adaptation under uncertainty. The core challenge for on-device embodied semantic scene graph generation is collecting high-quality observations within a limited action budget, and existing exploration strategies are not optimized for this constraint. This work modernizes reinforcement learning-based navigation strategies specifically for the task of embodied semantic scene graph generation, improving model quality within finite action horizons.

15

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan Song, Minghui Lin, Han Zhao, Hongyin Zhang, Zifeng Zhuang,...

Existing vision-language-action (VLA) models for robotic manipulation often rely on hierarchical or autoregressive architectures that introduce overhead, suffer from temporal inconsistency and long-horizon error accumulation, and require extra modules to capture environment dynamics. Authors present MMaDA-VLA, a fully native large pre-trained diffusion-based VLA model that unifies multi-modal instruction understanding and action generation in a single framework. The native discrete diffusion formulation addresses key limitations of prior VLA architectures, reducing error accumulation and enabling better capture of environment dynamics.

16

System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games

Guangyu Zhao, Ceyao Zhang, Chengdong Ma, Tao Wu, Yiyang Song, Haoxuan Ru, Yifan Zhong, Ruilin Yan, Lingfeng Li, Ruochong...

Long-horizon multi-human robotic tabletop games pose a unique systems challenge: small initial perceptual or execution errors accumulate over time, invalidate the robot's internal task state, and propagate through decision-making modules to derail interaction. Instead of focusing on improving individual components, this work studies how deliberate system design can maintain consistent internal state over long interaction horizons. Using Mahjong as a testbed, authors present an integrated architecture that explicitly maintains separate consistent states for perception, execution, and interaction, mitigating error propagation in long-horizon multi-agent robotic interaction.

17

LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang, Jialin Tian, Xinyu Zhan, Lifeng Zhu, Cewu Lu, Lixin Yang

Existing vision-language-action models regress robot actions directly from 2D semantic features, forcing them to implicitly learn complex 3D physical interactions, which degrades performance in unfamiliar environments with novel spatial dynamics. Authors introduce LaMP, a dual-expert VLA framework that incorporates dense 3D scene flow as a latent motion prior for robotic manipulation. The framework aligns a motion expert that generates 3D flow predictions with an action expert that predicts robot commands via gated cross-attention, improving performance on out-of-distribution manipulation tasks.

18

UMBRELLA: Uncertainty-aware Multi-robot Reactive Coordination under Dynamic Temporal Logic Tasks

Qisheng Zhao, Meng Guo, Hengxuan Du, Lars Lindemann, Zhongkui Li

Most existing multi-robot coordination methods assume static tasks or simply replan from scratch when the environment changes, and struggle with dynamic collaborative tasks involving moving targets. Authors propose UMBRELLA, a coordination framework that explicitly models uncertainty in target motion prediction via Conformal Prediction, while respecting spatial-temporal constraints specified by Linear Temporal Logic. The framework enables reactive, uncertainty-aware coordination for multi-robot teams working on dynamic temporal logic tasks.

19

IntentReact: Guiding Reactive Object-Centric Navigation via Topological Intent

Yanmei Jiao, Anpeng Lu, Wenhan Hu, Rong Xiong, Yue Wang, Huajin Tang, Wen-an Zhang

Object-goal visual navigation requires robots to reason about semantic structure under partial observability, and recent object-level topological map approaches enable long-horizon navigation without dense geometric reconstruction. However, these methods suffer from a gap between global topological guidance and local perception-driven control, as local decisions only use current egocentric observations and do not incorporate out-of-view global intent. Authors propose IntentReact, a method that integrates topological global intent into local reactive navigation, closing the gap between global planning and local control to improve long-horizon navigation performance.

20

Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics

João Castelo-Branco, José Santos-Victor, Alexandre Bernardino

Autonomous object search for indoor mobile robots is challenging due to partial observability, perceptual uncertainty, and the need to balance exploration and navigation efficiency. Classical probabilistic methods explicitly model uncertainty but rely on handcrafted heuristics, while deep reinforcement learning enables adaptive policies but suffers from slow convergence and poor interpretability. Authors propose a hybrid framework that integrates Bayesian inference with deep reinforcement learning for object navigation, combining the strengths of both approaches to improve search performance.

21

Bayesian Learning-Enhanced Navigation with Deep Smoothing for Inertial-Aided Navigation

Nadav Cohen, Itzik Klein

Accurate post-processing navigation is critical for survey and mapping applications, where full measurement history can be used to refine past state estimates. Fixed-interval smoothing algorithms are optimal under Gaussian assumptions, but loosely coupled INS/GNSS systems inherit persistent systematic position bias from raw GNSS measurements that model-based smoothers cannot resolve. Authors propose BLENDS, a method that integrates Bayesian learning with deep smoothing to reduce position bias and improve navigation accuracy for inertial-aided post-processing navigation.

22

SafeGuard ASF: SR Agentic Humanoid Robot System for Autonomous Industrial Safety

Thanh Nguyen Canh, Thang Tran Viet, Thanh Tuan Tran, Ben Wei Lim

The growth of unmanned 'dark factories' with no on-site human workers demands autonomous safety systems that can detect and respond to multiple types of industrial hazards. Authors present SafeGuard ASF (Agentic Security Fleet), a comprehensive framework that uses humanoid robots for autonomous industrial hazard detection and response. The system integrates RGB-D multi-modal perception, a ReAct-based agentic reasoning framework, and learned locomotion policies on the Unitree G1 humanoid platform to address common hazard scenarios including fire detection and abnormal temperature monitoring.

23

Connectivity-Aware Representations for Constrained Motion Planning via Multi-Scale Contrastive Learning

Suhyun Jeon, Yumin Lim, Woo-Jeong Baek, Hyeonseo Kim, Suhan Park, Jaeheung Park

Constrained motion planning requires connecting start and goal configurations while meeting task-specific constraints, but becomes inefficient or infeasible when feasible configurations are split into disconnected essentially mutually disconnected (EMD) components. Additional complexity comes from constraints that restrict feasible space to low-dimensional submanifolds and kinematic redundancy that creates discrete self-motion manifolds for single end-effector poses. Authors address these challenges using connectivity-aware representations learned via multi-scale contrastive learning, improving planning performance for constrained problems.

24

A Minimum-Energy Control Approach for Redundant Mobile Manipulators in Physical Human-Robot Interaction Applications

Davide Tebaldi, Niccolò Paradisi, Fabio Pini, Luigi Biagiotti

Mobile manipulators that can physically interact with humans enable new collaborative tasks that fixed-base manipulators cannot perform, but their additional degrees of freedom make controller design more challenging. A key open priority is optimizing controller performance for physical human-robot interaction scenarios. This work proposes a minimum-energy control approach for redundant mobile manipulators designed specifically for physical human-robot interaction applications, optimizing performance by leveraging the system's extra degrees of freedom.

25

The Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering

Umair Siddique

As AI assistants become more common in safety engineering workflows for physical AI and robotic systems, it is critical to understand whether they improve safety analysis or introduce systematic blind spots that only emerge after deployment. This work argues that safety engineering cannot be properly evaluated via standard benchmarking because safety competence is multidimensional, context-dependent, inherently incomplete, and subject to legitimate expert disagreement. Authors formalize this as the 'competence shadow' and provide theoretical bounds on AI assistance for safety engineering, highlighting unaddressed risks of blind spots in AI-aided safety work.

26

Dissimilarity-Based Persistent Coverage Control of Multi-Robot Systems for Improving Solar Irradiance Prediction Accuracy in Solar Thermal Power Plants

Haruki Kawase, Taiga Sugawara, A. Daniel Carnerero

Accurate solar irradiance forecasting is critical for effective control of solar thermal power plants, but existing kriging-based prediction methods lack a dynamic sampling strategy to position mobile sensors to optimize prediction accuracy in real time. Authors introduce a dissimilarity map derived from a kriging prediction model, and propose a dissimilarity-based persistent coverage control algorithm that guides mobile sensors to positions that maximize forecast accuracy. The approach enables more accurate irradiance prediction with fewer mobile sensors, improving performance of solar thermal power plants.

27

CTS-PLL: A Robust and Anytime Framework for Collaborative Task Sequencing and Multi-Agent Path Finding

Junkai Jiang, Yitao Xu, Ruochen Li, Shaobing Xu, Jianqiang Wang

The collaborative task sequencing and multi-agent path finding (CTS-MAPF) problem requires multi-robot teams to complete sequences of tasks while avoiding collisions, and is extremely computationally complex. Authors introduce CTS-PLL, a hierarchical framework that extends existing configuration-based CTS-MAPF planning with two key improvements: a lock agent detection and release mechanism that uses complete planning for local re-planning, and an anytime refinement procedure based on Large Neighborhood Search. The approach improves robustness in dense environments and produces progressively better solutions over time.

28

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Young-Chae Son, Dae-Kwan Ko, Yoon-Ji Choi, Soo-Chul Lim

Modern human-robot collaboration environments increasingly benefit from integrating diverse sensor data beyond standard RGB vision to improve safety and efficiency, but thermal data has been largely overlooked in prior vision-language-action (VLA) research. Authors propose ThermoAct, a novel VLA framework that incorporates thermal sensor information for robotic perception and task execution. The framework uses a vision-language model as a high-level planner to interpret complex natural language instructions, leveraging thermal data to improve safety and task performance in human-robot collaboration.

29

$π$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

Johnathan Tucker, Denis Liu, Aiden Swann, Allen Ren, Javier Yu, Jiankai Sun, Brandon Kim, Lachlain McGranahan, Quan Vuon...

Pretrained vision-language-action (VLA) foundation models like π₀ have achieved strong generalization across fixed-base manipulators, but transferring these models to aerial manipulation remains challenging due to the mismatch between quasi-static fixed-base dynamics and underactuated highly dynamic flight dynamics. Authors introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLA models to aerial pick-and-place tasks. The work finds that visual representations transfer effectively from ground to aerial platforms, but control parameters require physics-guided adaptation to account for the different dynamics of flight.

30

Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu

High-fidelity learned traffic simulations trained on human driving data are critical for evaluating autonomous driving systems, and recent work has applied the next-token prediction paradigm from large language models to this problem with supervised fine-tuning. However, next-token prediction approaches limit active exploration of potentially valuable motion tokens, especially in underrepresented suboptimal regions of the state space. Authors propose an R1-style tokenized traffic simulation model that uses entropy-based uncertainty to drive exploration, enabling better coverage of rare traffic scenarios for more robust autonomous driving evaluation.