Article · Human Data Collection

Human Data Collection
Pipeline

A cost-effective, distributed approach to scaling human demonstration data for robot learning — collecting thousands of trajectories at a fraction of traditional in-lab costs.

Scale 分布式采集 成本优势 Imitation Learning Crowdsourcing
🎯
What is Human Data Collection Pipeline?
为什么我们需要分布式人类数据收集
💡 HDCP enables distributed collection of human demonstration data from crowd workers around the world, instead of relying solely on expensive in-lab expert demonstrations. This dramatically reduces costs while increasing data volume and behavioral diversity.
💰
Ultra Low Cost
Collect 1,000+ demonstrations for hundreds of dollars. Cost per trajectory is typically 10–100× lower than in-lab collection.
📈
Massive Scale
Parallel collection from hundreds of workers simultaneously. Scale to tens of thousands of trajectories in weeks, not months.
🌍
Greater Diversity
Many demonstrators naturally capture variation in styles and approaches, improving model generalization.
Fast Turnaround
Launch a data collection task and get completed trajectories back within days. Iterate on task design quickly.
🖥️
No Hardware Required
Workers use their own browser. No robot hardware needed for teleoperation data collection in simulation.
🔄
Continuous Data Flow
Keep the pipeline running to continuously add new data for ongoing model improvement over time.
📊
Cost Advantage: Traditional vs HDCP
Comparing approaches for 1,000 demonstration trajectories
Factor Traditional In-Lab HDCP Distributed
Expert Labor Cost $50k – $150k $500 – $2,000
Hardware Investment $10,000+ $0
Time to Complete 3 – 6 months 1 – 2 weeks
Demonstrator Diversity 1 – 5 people 50 – 200 people
Scaling to 10k trajectories Prohibitive Straightforward
Example HDCP Cost Breakdown — 1,000 trajectories
Worker payment per trajectory$0.50 – $1.50
Platform fees (MTurk, Prolific, etc.)+20% markup
Quality filtering (automated)~$50
Total Estimated Cost $600 – $1,800
100×
Cost reduction
vs. in-lab
10×
Faster time
to dataset
20×
More diverse
demonstrations
⚙️
The Pipeline Step-by-Step
从任务设计到最终数据集的完整流程 — click any step to expand
01
Setup
Task & Environment Design
Define goals, success conditions, and build the simulation environment.
  • Define task goal, success conditions, and reward function
  • Build simulation environment with proper camera views and rendering
  • Create reproducible environment resets for each trajectory
  • Define action space (joint angles, gripper commands, deltas)
02
Interface
Worker Interface Development
Web-based teleoperation UI accessible from any browser.
  • Web-based teleoperation UI — keyboard, mouse, and joystick support
  • Clear instructions, tutorial video, and practice trials before recording
  • Real-time visual feedback on task progress and success signal
  • One-click submission when complete; no installation required
03
Calibrate
Pilot & Calibration
Small batch (n=50–100) to validate the interface and calibrate pricing.
  • Check if workers understand instructions correctly and complete tasks
  • Measure average completion time per trajectory
  • Set fair price — target ~$10–15 per hour for workers
  • Identify common failure modes and misunderstandings
  • Find which workers produce consistent high-quality data early
04
Collect
Large-Scale Parallel Collection
100–500 workers collecting in parallel on crowdsourcing platforms.
  • Launch HITs on MTurk, Prolific, or similar crowdsourcing platforms
  • Release batches gradually (e.g., 100 HITs at a time) to maintain quality control
  • Auto-save trajectories every few seconds to cloud storage
  • Store raw video + states + actions separately for flexibility
  • Monitor a real-time progress dashboard; track per-worker statistics
05
Filter
Automated Quality Filtering
Remove failed, outlier, and duplicate trajectories automatically.
  • Filter out timeout, failure, and too-short trajectories
  • Outlier detection based on trajectory length and success rate distributions
  • Clustering to remove duplicate or near-identical behavior patterns
  • Keep workers with >60% success rate; block poor performers early
  • Typical retention: 70–85% of all collected trajectories pass filtering
06
Process
Data Processing & Format Conversion
Normalize and convert raw trajectories into a training-ready dataset.
  • Resample observations to a consistent frequency (e.g., 10 Hz)
  • Extract action tensors from raw teleoperation input (joint angles, gripper)
  • Normalize observations and actions to zero-mean, unit-variance
  • Split into train / validation / test sets (e.g., 80 / 10 / 10)
  • Convert to framework dataset format (RLDS, HDF5, JSON, etc.)
07
Train
Train & Iterate
Train your policy, evaluate performance, identify gaps, and repeat.
  • Train policy via imitation learning (BC, DiffusionPolicy, ACT, etc.)
  • Evaluate success rate and generalization on held-out test set
  • Identify under-represented scenarios and task failure modes
  • Collect additional targeted data for those gaps
  • Repeat until target performance is reached — the flywheel compounds
Key Success Factors
Best practices for getting high-quality data
🎮 Make It Easy to Control
Support multiple input modalities (mouse, keyboard, gamepad) with automatic sensitivity adjustment for each worker's setup.
📝 Clear Instructions
A 60-second video tutorial is worth 1,000 words. Show exactly what to do, what counts as success, and what to avoid.
⚡ Early Quality Filtering
Check quality after the first few trajectories from each worker. Block poor performers early to save money and maintain dataset quality.
💰 Fair Payment
Pay at least $10/hour. Better pay attracts better workers who produce higher quality data — this directly impacts model performance.
🔄 Allow Multiple Attempts
Workers improve with practice. Allowing retries produces better trajectories and reduces frustration-driven abandonment.
🎯 Auto-Reset Environment
One-click reset for failed attempts. Low-friction workflows keep workers engaged and completing more high-quality demonstrations.
⚠️ Common pitfall: Trying to collect too-complex tasks from non-expert workers. Start with simple, atomic tasks completable in 1–2 minutes. Chain simpler skills together instead of one complex monolithic task.
🤔
When Should You Use This Approach?
适用场景和不适用场景
✓  Good For
Simulation-based tasksAny task where you need many demonstrations in a sim environment
Imitation learningBehavior cloning needs massive, diverse demonstration data
Multi-task skill collectionCollect many different skills from different workers in parallel
Finetuning pre-trained policiesAdding diverse trajectories to improve generalization
✗  Less Suitable For
Real-world physical robotsPhysical hardware still needs in-lab collection
Ultra-precision tasksTasks requiring expert-level precision may still need domain experts
Safety-critical tasksWhen failure is extremely expensive, prefer expert-supervised collection
Very long-horizon tasksTasks over 10 minutes are hard for crowd workers — split into smaller steps
🚀
Getting Started
Your first data collection project checklist
1
Start small
Begin with a simple task taking 1–2 minutes per demonstration. Don't start with your most complex task.
2
Build the web interface
Use HTML5/JavaScript so workers just click a link — no installation needed. Three.js or Unity WebGL work well.
3
Make a tutorial
Record a 60-second screencast showing how to do the task. This single step dramatically improves data quality.
4
Run a pilot of 50 trajectories
Check results, see where workers struggle, adjust instructions and difficulty before scaling.
5
Scale up in batches
Release 100 trajectories at a time, monitor quality continuously, and retain your best workers.
6
Process and train
Run quality filters, convert to your dataset format, start training, and identify gaps for the next iteration.
💡 The XRollout platform provides built-in tools for distributed human data collection. Join our community to get access to the infrastructure.
📄
Original Document
Download the original PDF

This article summarizes the Human Data Collection Pipeline approach originally published at Physical Intelligence (π).

Download Original PDF