Memory for Robotics: Enhancing Temporal Decision-Making

Overview

This article breaks down the MEM (Multi-scale Embodied Memory) approach from the Physical Intelligence (PI) research project. MEM enables robots to handle long-horizon tasks (up to 15 minutes) by maintaining a structured memory of what they've done, what's still left to do, and where objects are located.

Core Idea

The problem: Current robots can do short, simple tasks, but they fail at long jobs like cleaning an entire kitchen. To complete these long tasks, robots need memory to:

Track progress through subtasks
Remember where objects are located
Avoid repeating failed attempts
Adapt when something changes

The solution: A multi-scale memory architecture that handles different time scales with different memory representations.

METHOD: Multi-Scale Embodied Memory

Three Memory Components

1. Short-term: Efficient Video Encoder

Stores and encodes the recent frame history
Handles the immediate sensory input
Compresses recent observations into a usable embedding

2. Long-term: Textual "Notes" about Task Progress

Maintains high-level, symbolic memory in natural language
Actively selected by the model—only store what matters
Examples: "Mug already placed on counter", "Coffee beans already added", "Window left pane still dirty"
Compact, human-readable, easy for language models to reason over

3. Subtask Selection Reasoning

Model chooses what to remember based on current context
Decides which subtask to work on next based on memory
Enables in-context adaptation when things go wrong

Key Design Choice: Mixing video (dense perceptual) and text (sparse symbolic) gives the best of both worlds. Video captures geometry and appearance that text can't. Text compresses high-level progress compactly and maintains information over much longer horizons than pure video approaches.

Experimental Results

MEM was tested on six long-horizon robotic tasks:

Task	Description	Duration
Swap 3 Mugs	Exchange positions of three mugs on counter	~1 min
Find Object	Search for hidden object in kitchen	~2 min
Unpack Groceries	Put all grocery items into correct cabinets	~5 min
Scoop Coffee	Fill coffee maker with ground coffee	~3 min
Grilled Cheese	Full preparation of grilled cheese sandwich	~10 min
Window Cleaning	Clean both sides of entire window	~15 min

Baseline Comparison

MEM outperformed all baselines across all tasks: - No memory: Pure image-to-action policy, fails on anything beyond a few steps - Pooling memory: Average all frames, loses temporal structure - Proprioceptive memory: Only remembers joint positions, misses object state - MEM (full): Achieves highest success rate across all task lengths

Ablation Studies

The full π₀.₆-MEM model was compared against ablated versions: - No memory: Low success rate, especially on longer tasks - Video only: Better than no memory but still degrades on long horizons - Naive text+video: Better but still less robust than active selection - Text only: Lacks perceptual detail, fails on spatially precise actions - Full MEM: Best overall performance, especially on tasks > 5 minutes

Performance degrades more slowly with MEM as task duration increases compared to baselines.

Key Findings

MEM enables in-context adaptation: Robots can adjust strategies after failures by recalling what didn't work.
Robust to occlusions and partial observability: Memory maintains state information even when objects aren't visible.
Handles truly long-horizon tasks: Successfully completes 10-15 minute tasks like making grilled cheese or cleaning a full window.
Multi-scale is better than single-scale: Combining video (short-term dense) with text (long-term sparse) outperforms any single representation.
Active selection matters: Letting the model choose what to remember is better than just storing everything.

Why This Matters

Most robot learning today focuses on short-horizon tasks in controlled settings. But for robots to be useful in real homes, they need to complete long, complex tasks. Memory is the missing component that makes this possible.

MEM shows that a hierarchical, multi-scale approach works: different time scales need different memory representations. Dense perceptual compression for the recent past, sparse symbolic notes for the long-term.

**Project Homepage:** More details at <https://www.pi.website/research/memory> **Paper PDF:** <https://www.pi.website/download/Mem.pdf>

A Taxonomy of Memory Approaches

Recently, RoboMME provides a comprehensive benchmark and taxonomy for evaluating memory in VLA models. They classify memory approaches along two dimensions:

By Cognitive Function (Task Types)

Functional Classification (What memory is for)

Temporal memory: Tracks event counts, sequence ordering, and when to transition to next subtask. Example: "we need three wiping passes on the window".
Spatial memory: Maintains object locations even when they're occluded. Example: "I put the mug there earlier, now it's behind the box but I still remember where it is".
Object memory: Preserves referential consistency over time. Example: keeping track of which object is "the red mug" even when similar objects are present.
Procedural memory: Stores demonstrated motion patterns for imitation. Example: recalling how you successfully flipped the pancake last time.

Representation Classification (How memory is stored)

Symbolic: Uses discrete natural language tokens to store memory. Compact, human-readable, good for high-level facts. MEM uses this approach for long-term memory.
Perceptual: Stores selected visual tokens from past frames. Two common selection strategies:
Token dropping: Remove redundant patches based on RGB differences
Uniform sampling: Evenly sample frames from history
Recurrent: Uses recurrent neural networks (LSTM/GRU) to maintain a hidden state. Differentiable end-to-end, but can forget over very long horizons.

Key Findings from RoboMME

The main empirical finding from RoboMME is that no single memory method dominates all tasks. Effectiveness depends on the task type:

Symbolic memory excels at temporal and object memory tasks because counting and references are naturally symbolic
Perceptual memory with grounding works better for spatial tasks where exact visual context matters
Recurrent memory is simple but degrades on extremely long sequences

MEM fits squarely into this taxonomy as a hybrid approach: it uses perceptual/video encoding for short-term memory and symbolic/text for long-term memory, combining the strengths of both representations.

References

Physical Intelligence (PI) Research. (2024). "MEM: Multi-scale Embodied Memory for Long-Horizon Robotic Tasks."
Project website: https://www.pi.website/research/memory
Liu et al. (2026). "RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies." arXiv:2603.04639