26
March 2026

Daily Robotics Digest

63 curated items from arXiv, industry news, and the community

Executive Summary

This edition of the XRollout Daily Robotics Digest collects 62 new preprints and academic works spanning core robotics research domains, with a strong focus on advancing vision-language-action models, multi-agent coordination, and generalist robot control. Key contributions address longstanding challenges in tactile perception for contact-rich interaction, sim-to-real transfer, safe control for humanoids, and scalable simulation asset generation. The collection spans application areas from surgical robotics and agricultural automation to autonomous driving and swarm drone displays, offering new tools and insights for both researchers and practitioners.

📄

New Research Papers

63 items
1

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, ...

Video-Action Models (VAMs) enable strong long-horizon task performance via visual reasoning, but fail to capture fine-grained force and contact information critical for contact-rich physical interactions. This work introduces VTAM, a Video-Tactile-Action Model that integrates tactile sensing to address the limitations of vision-only VAMs. The model enables more precise and stable behavior in scenarios where critical interaction states are not fully observable from vision alone.

2

Planning over MAPF Agent Dependencies via Multi-Dependency PIBT

Zixiang Jiang, Yulun Zhang, Rishi Veerapaneni, Jiaoyang Li

Modern Multi-Agent Path Finding (MAPF) requires efficient algorithms that can plan for hundreds to thousands of agents in congested environments within tight time constraints. The popular PIBT and its extension EPIBT are limited by their rule-based design, which restricts planning to conflicts involving at most one other agent, reducing generality. This work presents Multi-Dependency PIBT, a new approach that expands PIBT to handle multiple concurrent agent dependencies for more flexible multi-agent planning.

3

Rectify, Don't Regret: Avoiding Pitfalls of Differentiable Simulation in Trajectory Prediction

Harsh Yadav, Christian Bohn, Tobias Meisen

Open-loop trajectory prediction models for autonomous driving suffer from cascading compounding errors caused by small initial deviations. While differentiable closed-loop simulators aim to solve this problem, they suffer from shortcut learning where future ground truth information leaks into model predictions via gradient flow. This work argues for rectifying this leakage rather than training around it, addressing the core issue of non-causal error correction in differentiable simulation.

4

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang

High-quality sim-ready articulated 3D assets are critical for embodied AI and physical simulation, but modern 3D generation focuses primarily on static meshes, creating a supply gap. Existing articulated asset creation methods use multi-stage pipelines that accumulate error, while unified MLLM-based approaches face high memory overhead from dense voxel tokenization that limits scalability. This work introduces SIMART, a method that decomposes monolithic static meshes into sim-ready articulated assets using an MLLM-based approach optimized for lower memory usage and scalability.

5

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan C...

Existing video-based world models for robotic manipulation often generate physically implausible behavior such as object penetration and anti-gravity motion, due to training on generic visual data and likelihood objectives that ignore physical constraints. This work presents ABot-PhysWorld, a 14B-parameter Diffusion Transformer model designed to generate physically plausible, action-controllable manipulation videos. The model is trained on a curated dataset of 3 million physics-annotated manipulation clips and uses a novel DPO-based post-training alignment to enforce physical consistency.

6

PinPoint: Monocular Needle Pose Estimation for Robotic Suturing via Stein Variational Newton and Geometric Residuals

Jesse F. d'Almeida, Tanner Watts, Susheela Sharma Stern, James Ferguson, Alan Kuntz, Robert J. Webster

Reliable 3D needle pose estimation is critical for autonomous robotic suturing, but nearly all existing methods rely on stereoscopic vision. In common monocular endoscopic settings, depth ambiguity and rotational symmetry create a multimodal distribution of feasible poses rather than a single well-defined estimate, making the problem inherently ill-posed. This work introduces PinPoint, a probabilistic variational inference framework based on Stein Variational Newton and geometric residuals that directly accounts for pose ambiguity to enable accurate monocular needle pose estimation.

7

Edge Radar Material Classification Under Geometry Shifts

Jannik Hohmann, Dong Wang, Andreas Nüchter

Material classification improves robotic navigation and interaction in conditions where cameras and LiDAR degrade in performance. This work presents a lightweight mmWave radar material classification pipeline optimized for ultra-low-power edge devices that achieves 94.2% macro-F1 score under nominal training geometry. The work also identifies a significant performance drop under realistic geometry shifts such as sensor height changes and small tilts, highlighting a key open challenge for edge radar perception.

8

Strain-Parameterized Coupled Dynamics and Dual-Camera Visual Servoing for Aerial Continuum Manipulators

Niloufar Amiri, Farrokh Janabi-Sharifi

Tendon-driven aerial continuum manipulators combine the maneuverability of UAVs with the compliance of continuum robots, but existing coupled dynamic models have high computational cost and do not explicitly account for underactuation of the aerial base. This work presents a generalized dynamic formulation for underactuated coupled TD-ACMs that integrates a strain-parameterized Cosserat rod model with a rigid-body UAV model into a unified framework. The approach also includes a dual-camera visual servoing control scheme for this class of manipulators.

9

Learning Multi-Agent Local Collision-Avoidance for Collaborative Carrying tasks with Coupled Quadrupedal Robots

Francesca Bray, Simone Tolomei, Andrei Cramariuc, Cesar Cadena, Marco Hutter

Collaborative carrying by multiple quadrupedal robots has great potential for warehouse and construction applications, but existing coordination methods mostly assume obstacle-free environments or rely on pre-recorded maps and off-line planning, making them unsuitable for most real-world scenarios. This work focuses on local collision avoidance for two mechanically coupled quadrupedal robots performing collaborative carrying. The work proposes a learned approach that enables adaptive, on-the-fly collision avoidance without prior maps, supporting deployment in unstructured real environments.

10

A Multimodal Framework for Human-Multi-Agent Interaction

Shaid Hasan, Breenice Lee, Sujan Sarker, Tariq Iqbal

Human-robot interaction is increasingly moving toward multi-robot socially interactive environments, but existing systems struggle to unify multimodal perception, embodied expression, and coordinated decision-making into a single scalable framework. This work introduces a multimodal framework for human-multi-agent interaction where each individual robot acts as an autonomous cognitive agent with integrated multimodal perception and LLM-driven planning grounded in embodiment. A central team-level coordination module manages shared interaction goals to enable natural human interaction with a robot team in shared physical spaces.

11

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

Qinglun Zhang, Shen Cheng, Tian Dan, Haoqiang Fan, Guanghui Liu, Shuaicheng Liu

SE(3)-equivariant policies improve data efficiency for robotic manipulation, but existing methods suffer from high computational cost, reliance on single-modality inputs, and instability when combined with fast sampling methods. This work introduces E3Flow, a hybrid SE(3)-equivariant visuomotor flow policy framework built on spherical harmonic representations that unifies efficient rectified flow with stable multi-modal equivariant learning for the first time. The approach addresses key limitations of existing equivariant diffusion policies, enabling more practical deployment for manipulation tasks.

12

AeroScene: Progressive Scene Synthesis for Aerial Robotics

Nghia Vu, Tuong Do, Dzung Tran, Binh X. Nguyen, Hoan Nguyen, Erman Tjiputra, Quang D. Tran, Hai-Nguyen Nguyen, Anh Nguye...

Drone simulators currently rely heavily on manual scene creation, which is time-consuming and difficult to scale, despite the growing impact of generative models across robotics. This work introduces AeroScene, a hierarchical diffusion model for progressive 3D scene synthesis specifically for aerial robotics simulation. The approach uses hierarchy-aware tokenization and multi-branch feature extraction to jointly reason about global scene layout and local details, ensuring physical plausibility of generated scenes and reducing manual effort for simulation environment creation.

13

Path Planning and Reinforcement Learning-Driven Control of On-Orbit Free-Flying Multi-Arm Robots

Álvaro Belmonte-Baeza, José Luis Ramón, Leonard Felicetti, Miguel Cazorla, Jorge Pomares

On-orbit servicing requires reliable motion planning and control for free-flying multi-arm robots, which must handle dynamic and kinematic constraints as well as uncertainty in the space environment. This work presents a hybrid approach that integrates trajectory optimization (TO) for feasible path generation with reinforcement learning (RL) for adaptive trajectory tracking under uncertainty. The multi-arm robot design includes thrusters for body control, enabling redundancy and stability for complex space operations, while the hybrid approach reduces tracking error compared to single-method baselines.

14

LiZIP: An Auto-Regressive Compression Framework for LiDAR Point Clouds

Aditya Shibu, Kayvan Karim, Claudio Zito

The large data volume generated by LiDAR sensors in autonomous vehicles creates processing and V2X transmission bottlenecks. Existing lossless compression methods face a tradeoff between adaptability and computational cost: standard algorithms like LASzip lack adaptability, while deep learning approaches have prohibitive computational overhead. This work introduces LiZIP, a lightweight, near-lossless zero-drift compression framework based on neural predictive coding that uses a compact MLP to predict point coordinates from local context, balancing compression performance and computational efficiency.

15

PHANTOM Hand

Teng Yan, Jiongxu Chen, Qixiang Hua, Yue Yu, Zihang Wang, Yaohua Liu, Bingzhuo Zhong

Tendon-driven underactuated robotic hands excel at adaptive grasping but suffer from kinematic unpredictability and nonlinear force transmission, limiting their ability to perform precise shaping and handle reliable payloads for complex manipulation. This work introduces the PHANTOM Hand, a 1:1 human-scale modular underactuated hand with 6 actuators and 15 degrees of freedom. The proposed unified framework bridges the gap between precise analytic motion shaping and robust compliant grasping, addressing the core limitations of traditional underactuated designs.

16

Active Robotic Perception for Disease Detection and Mapping in Apple Trees

Hayden Feddock, Francisco Yandun, Srđan Aćimović, Abhisesh Silwal

Large-scale commercial apple orchards require timely disease monitoring, but manual scouting is labor-intensive, expensive, and often detects outbreaks too late at coarse spatial resolutions. This work presents an autonomous mobile active perception system for targeted detection and high-resolution mapping of fire blight, one of the most devastating diseases affecting apple trees, in dormant trees. The system integrates flash-illuminated stereo RGB sensing to enable automated, scalable disease monitoring that can improve orchard management outcomes.

17

AirSimAG: A High-Fidelity Simulation Platform for Air-Ground Collaborative Robotics

Yangjie Cui, Xin Dong, Boyang Gao, Jinwu Xiang, Daochun Li, Zhan Tu

Heterogeneous air-ground collaborative robot systems have strong potential for applications like search and rescue, surveillance, and environmental monitoring, but existing simulation platforms are mostly designed for single-agent dynamics and lack dedicated tools for interactive air-ground collaboration. This work presents AirSimAG, a high-fidelity simulation platform specifically for air-ground collaborative robotics built on the existing AirSim framework. The platform enables realistic testing and development of heterogeneous multi-agent collaboration algorithms for real-world applications.

18

Learning Actuator-Aware Spectral Submanifolds for Precise Control of Continuum Robots

Paul Leonard Wolff, Hugo Buurmeijer, Luis Pabon, John Irvin Alora, Mark Leone, Roshan S. Kaundinya, Amirhossein Kazemipo...

Continuum robots have high-dimensional nonlinear dynamics that are tightly coupled with their actuation mechanisms, making accurate and efficient control challenging. Spectral submanifold reduction is a leading method for reducing high-dimensional nonlinear systems to low-dimensional invariant manifolds, but existing approaches do not explicitly incorporate actuation. This work introduces control-augmented spectral submanifolds (caSSMs) that explicitly include control inputs in the state representation to capture nonlinear state-actuation couplings, enabling more precise control of continuum robots.

19

YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

Marios Impraimakis, Daniel Vazquez, Feiyu Zhou

Autonomous vehicle perception systems lack transparency about the reliability of object detection confidence scores in visually degraded or ambiguous scenes, creating a safety challenge for deployment. This work examines a modified YOLOv10 detector that uses Kolmogorov-Arnold networks as an interpretable post-hoc surrogate to model detection trustworthiness using geometric and semantic features. The approach improves the interpretability and trustworthiness of object detection for autonomous driving and other robotic perception applications.

20

Generative Event Pretraining with Foundation Model Alignment

Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza

Event cameras offer robust visual sensing under fast motion and challenging illumination, but their unique data format and limited labeled data make it difficult to train transferable event-based visual foundation models. This work introduces Generative Event Pretraining (GEP), a two-stage framework that transfers semantic knowledge from large-scale internet image datasets to event data while learning event-specific temporal features. The approach addresses the data scarcity challenge for event-based perception, enabling better transfer learning across downstream robotics tasks.

21

Task-Aware Positioning for Improvisational Tasks in Mobile Construction Robots via an AI Agent with Multi-LMM Modules

Seongju Jang, Francis Baek, SangHyun Lee

Construction sites are highly dynamic, requiring robots to handle improvisational tasks where task locations, timing, and context are not known in advance, but existing mobile construction robot work rarely addresses this class of tasks. This work proposes an LMM-based AI agent that understands natural language instructions for improvisational tasks, identifies the required task location, and positions the robot accordingly. The agent decomposes functionality into three parallel Large Multimodal Model modules, enabling robust performance on unstructured construction site tasks.

22

Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring

Teng Yan, Zhengyang Pei, Chengyu Shi, Yue Yu, Yikun Chen, Zilong Zhu, Zelin Fang, Kaile Guo, Zihang Wang, Peigen Tian, B...

Deploying Vision-Language-Action (VLA) models on resource-constrained edge devices faces a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic industrial manipulation. This work introduces Agile-VLA, a hierarchical framework for industrial pose rectification designed for edge devices like the NVIDIA Jetson Orin Nano. The core innovation is Implicit Affordance Anchoring, which maps geometric visual cues directly to structured parametric action predictions, reducing latency to enable real-time edge control.

23

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

Ruixing Jin, Zicheng Zhu, Ruixiang Ouyang, Sheng Xu, Bo Yue, Zhizheng Wu, Guiliang Liu

Sim-to-real transfer is critical for learning dexterous manipulation policies, as real-world data collection is prohibitively expensive, but there is a lack of empirical research grounding sim-to-real methods in real-world dexterous manipulation tasks, especially for generalist Vision-Language-Action models. This work presents a systematic empirical study of how different sim-to-real generalization approaches perform for VLA-based dexterous manipulation. The study provides empirical insights to guide future development of more reliable sim-to-real methods for generalist dexterous manipulation policies.

24

DecompGrind: A Decomposition Framework for Robotic Grinding via Cutting-Surface Planning and Contact-Force Adaptation

Shunsuke Araki, Takumi Hachimine, Yuki Saito, Kouhei Ohnishi, Jun Morimoto, Takamitsu Masubara

Robotic grinding is a widely used manufacturing process, but automating efficient grinding for workpieces of varying shapes and material hardness remains challenging, due to variable removal resistance and difficulties in modeling shape transitions and learning across diverse conditions. This work introduces DecompGrind, a decomposition framework for robotic grinding that splits the problem into cutting-surface planning and contact-force adaptation subproblems. The decomposition approach addresses the challenges of varying contact conditions, enabling more efficient and flexible automated grinding without requiring large amounts of task-specific training data.

25

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Aditya Potnis, Francisco Affonso, Shreya Gummadi, Naveen Kumar Uppalapati, Girish Chowdhary

Zero-shot robot navigation in unstructured environments requires assessing traversability relative to a robot's specific embodiment, but existing approaches require task-specific training or high rates of expensive VLM inference. This work introduces CATNAV, a cost-aware traversability navigation framework that uses multimodal LLMs to enable zero-shot embodiment-aware costmap generation without task-specific training. A novel visuosemantic caching mechanism reduces online VLM queries by 85.7% by reusing prior risk assessments for semantically similar frames, enabling efficient real-time deployment.

26

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang

Embodied robotic photographers must bridge the semantic gap between high-level natural language aesthetic commands and low-level geometric camera control, a challenge that has not been well addressed in prior work. This work introduces PhotoAgent, which integrates LMM chain-of-thought reasoning with an analytical control framework to solve this problem. PhotoAgent first translates subjective aesthetic goals into geometric constraints to compute an initial high-quality viewpoint, then iteratively refines the pose via visual reflection in a photorealistic internal simulator, enabling high-quality robotic photography from natural language instructions.

27

Instrument-Splatting++: Towards Controllable Surgical Instrument Digital Twin Using Gaussian Splatting

Shuojue Yang, Zijian Wu, Chengjiaao Liao, Qian Li, Daiyun Shen, Chang Han Low, Septimiu E. Salcudean, Yueming Jin

Controllable high-fidelity digital twins of surgical instruments are critical for Real2Sim transfer and synthetic data generation for robot-assisted surgery. This work presents Instrument-Splatting++, a monocular 3D Gaussian Splatting framework that reconstructs surgical instruments as fully controllable high-fidelity digital assets. The pipeline uses part-wise geometry pretraining to inject CAD priors into Gaussian primitives, enabling part-aware semantic rendering and controllable pose adjustment for simulation and training.

28

DiSCo: Diffusion Sequence Copilots for Shared Autonomy

Andy Wang, Xu Yan, Brandon McMahan, Michael Zhou, Yuyang Yuan, Johannes Y. Lee, Ali Shreif, Matthew Li, Zhenghao Peng, B...

Shared autonomy combines human input with AI copilot correction to improve performance on complex control tasks like robotic teleoperation, but existing copilot methods struggle to generate action sequences that align consistently with past user behavior and goals. This work introduces DiSCo (Diffusion Sequence Copilots), a diffusion-based shared autonomy method that plans full action sequences consistent with past user actions. The approach significantly improves task performance for human control of high-dimensional robotic systems by generating context-aware corrective action sequences.

29

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li, Junxi Li, Jianwen Xie, Hao Su, Zhuowen Tu

Vision-Language-Action models show promise for generalist robotic control, but their performance remains subpar for mobile manipulation in complex household environments, which requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions that exceed the capabilities of standard imitation learning. This work introduces SG-VLA, a framework for learning spatially-grounded VLA models for mobile manipulation that strengthens perception and representation via auxiliary task co-training and multi-modal input enhancement. The approach improves performance on 13-dimensional continuous control for mobile manipulation tasks in complex household environments.