From Evaluation to Closed-Loop Improvement: How Community Feedback Makes Robots Smarter

From Evaluation to Closed-Loop Improvement: How Community Feedback Makes Robots Smarter

The Core Question: How Do We Know Our Robot Works?

When we build a robot or train a new model, we face a fundamental question: how do we verify that it actually works safely and reliably across all the conditions it might encounter?

In traditional software development, we write unit tests, integration tests, and end-to-end tests. But robotics is different—our software interacts with the physical world, which is infinitely variable. A model that works perfectly in one environment might fail completely in another. How do we design benchmarks that capture this complexity?

This question leads us to two dominant evaluation paradigms: open-loop evaluation and closed-loop evaluation. Each has its strengths and weaknesses, and neither alone provides the full solution.


Open-Loop Evaluation: The Offline Approach

How It Works

Open-loop evaluation follows a simple recipe: 1. Record sensor data (observations) from a real robot or human teleoperation in various scenarios 2. Replay this recorded data sequentially to your model 3. Score the model's output against the expert demonstrations using various metrics

Common metrics include: - Mean Squared Error (MSE) between model output and expert action - KL divergence between action distributions - Success rate in classification/command following tasks - Various correlation metrics

Why It's Popular

Open-loop evaluation is attractive because it's easy—you only need one dataset and one model. You don't need a physical robot, you don't need to run the system in real-time, and you can parallelize evaluation across many GPUs. It's great for rapid iteration during model development.

The Fundamental Problem

The problem with open-loop evaluation is right there in the name—it's open loop. Because evaluation doesn't feed the model's output back into the next observation, it ignores the closed-loop dynamics of real control systems.

Consider this: a small error early in a trajectory can compound over time, leading to catastrophic failure later. In open-loop evaluation, this error never compounds—because every step starts from the ground-truth observation. The model might get a good score, but when you actually run it on a robot, it diverges and fails.

In other words: good open-loop performance doesn't guarantee good closed-loop performance. It's a necessary but not sufficient indicator of real-world capability.


Closed-Loop Evaluation: The Online Approach

How It Works

Closed-loop evaluation addresses this issue by closing the loop: 1. The model receives an observation from the environment 2. The model outputs an action 3. The environment (simulator or real robot) updates based on that action and produces a new observation 4. Repeat until task completion or failure

The model is in full control—just as it would be in deployment. This means errors can compound, just like they do in reality. Success or failure is measured directly by whether the task gets done.

Why It's Better (And Harder)

Closed-loop evaluation is more realistic because it actually simulates (or executes) the full control loop. It measures what actually matters—can the robot complete the task starting from the initial condition?

But this realism comes at a cost: - Latency challenges: You need to measure not just whether the model succeeds, but whether it can do it within the real-time constraints required by the system - Simulation vs Reality: If you evaluate in simulation, you have to deal with the sim2real gap—a model that works in simulation might fail on hardware because the simulator doesn't perfectly capture the real world - Scaling costs: Evaluating on real hardware is slow and expensive—you can't parallelize it across thousands of GPUs like you can with open-loop evaluation


The Real World Problem: What About Unseen Scenarios?

Both open-loop and closed-loop evaluation assume you have a fixed set of test scenarios. But in the real world:

Your robot will encounter situations your evaluators never thought of.

Suppose you're a company building a home robot. You test and test in your lab, you fix all the issues you find, you ship the product. Then a customer gets it home, and it fails trying to navigate their specific living room arrangement—something nobody in your lab ever tested.

Or suppose you're an open-source community maintaining a general-purpose robot model. A user deploys it on their custom hardware in a completely new environment (a farm, a warehouse, a hospital) that wasn't in your training or test data. It fails. What happens then?

In the traditional closed-source model: - The user reports the issue - If enough users complain, maybe the company will collect data and fix it in the next software update - But this process is slow, centralized, and depends on the company's priorities

What if we could do better?


The XRollout Vision: Community-Powered Data Closed-Loop

We believe that in an open-source world, the community itself can close the improvement loop. Here's how it works:

Step 1: Everyone Can Report Failures

If your robot fails in your environment—report it. With XRollout, you can easily: - Upload a video or recording of the failure - Add a description of what went wrong - Tag the scenario type (e.g., "slippery floor", "low lighting", "cluttered kitchen")

This isn't just "filing a bug ticket"—you're contributing data that can be used to fix the problem.

Step 2: The Community Collects Diverse Data

The beauty of a large community is that collectively, we see more scenarios than any single lab or company could ever test. One user has pets at home, another works in a dusty factory, another has a dark basement, another tests on grass outdoors.

Every failure reported is an opportunity: - It reveals a gap in the current model - It provides real-world data from that scenario - It lets the community prioritize what to fix next

Step 3: Retrain and Improve

With this new data, the community can: 1. Add the failure scenarios to the evaluation benchmark 2. Retrain the model on the new aggregated data 3. Validate that the fix actually works in the previously failing scenario 4. Ship an improved model back to the community

This creates a continuous improvement loop:

Deploy → Fail in New Scenario → Report + Collect Data → Retrain → Improve → Deploy

Why This Works

1. Diversity is Strength

No organization can replicate the diversity of environments that a global community encounters daily. Your "edge case" is someone else's "everyday case." By collecting data from everyone, we cover more of the real-world distribution.

2. Faster Iteration

In the traditional model, users wait for the vendor to fix issues. In the community model, any qualified researcher or developer can work on fixing the issue immediately. Multiple people can try different approaches in parallel. The best solution wins.

3. Transparent Safety

Safety is critical in robotics. When failures are reported openly, everyone can see them, everyone can understand the risks, and the community can work together to fix them. There's no hiding known issues—transparency builds trust.

4. Everyone Benefits

When one user contributes data from their scenario, every other user who encounters that same scenario benefits. It's a public good—like open-source software itself. As the saying goes: "A rising tide lifts all boats."


The Chicken-and-Egg Problem

Of course, building this closed-loop system isn't easy. We need: - Enough users to report enough diverse failures - Enough contributors to help label and process the data - Enough compute to retrain the model regularly

This is a classic network effect problem—it gets better as more people join. That's why we're starting now, building the infrastructure and the community together. Every contribution, no matter how small, helps make the system better for everyone.

You don't need to be a big company with a big robot to contribute. If you're an enthusiast with a small robot at home, and you encounter a failure, your data is just as valuable as data from a fancy research lab. Real-world data from real users is what makes this work.


Closing Thoughts

Evaluation is not a one-time step you do before shipping. Evaluation is the beginning of the improvement loop.

  • Open-loop evaluation gives us fast metrics for rapid iteration during development
  • Closed-loop evaluation in simulation or on real hardware gives us realistic assessment before deployment
  • But the real improvement comes from closing the loop after deployment—letting the community report failures and contribute data from unseen scenarios

This is the core idea behind XRollout: we build it together, we improve it together. Every failure is an opportunity to learn, and every contribution makes the robot smarter for everyone.

If you've ever had a robot fail in a situation no one tested before—join us. Report the failure. Contribute the data. Help us make the model better for the next person who encounters that same scenario.

Together, we can build robot intelligence that works in all the real world, not just the scenarios we thought to test.


Built by the community, improved by the community—for all the real world's diversity.

0 Comments