Human Data Collection Pipeline

🎯

What is Human Data Collection Pipeline?

为什么我们需要分布式人类数据收集

💡 HDCP enables distributed collection of human demonstration data from crowd workers around the world, instead of relying solely on expensive in-lab expert demonstrations. This dramatically reduces costs while increasing data volume and behavioral diversity.

💰

Ultra Low Cost

Collect 1,000+ demonstrations for hundreds of dollars. Cost per trajectory is typically 10–100× lower than in-lab collection.

📈

Massive Scale

Parallel collection from hundreds of workers simultaneously. Scale to tens of thousands of trajectories in weeks, not months.

🌍

Greater Diversity

Many demonstrators naturally capture variation in styles and approaches, improving model generalization.

⚡

Fast Turnaround

Launch a data collection task and get completed trajectories back within days. Iterate on task design quickly.

🖥️

No Hardware Required

Workers use their own browser. No robot hardware needed for teleoperation data collection in simulation.

🔄

Continuous Data Flow

Keep the pipeline running to continuously add new data for ongoing model improvement over time.

📊

Cost Advantage: Traditional vs HDCP

Comparing approaches for 1,000 demonstration trajectories

Factor	Traditional In-Lab	HDCP Distributed
Expert Labor Cost	$50k – $150k	$500 – $2,000
Hardware Investment	$10,000+	$0
Time to Complete	3 – 6 months	1 – 2 weeks
Demonstrator Diversity	1 – 5 people	50 – 200 people
Scaling to 10k trajectories	Prohibitive	Straightforward

Example HDCP Cost Breakdown — 1,000 trajectories

Worker payment per trajectory$0.50 – $1.50

Platform fees (MTurk, Prolific, etc.)+20% markup

Quality filtering (automated)~$50

Total Estimated Cost $600 – $1,800

100×

Cost reduction
vs. in-lab

10×

Faster time
to dataset

20×

More diverse
demonstrations

⚙️

The Pipeline Step-by-Step

从任务设计到最终数据集的完整流程 — click any step to expand

01

Setup

Task & Environment Design

Define goals, success conditions, and build the simulation environment.

Define task goal, success conditions, and reward function
Build simulation environment with proper camera views and rendering
Create reproducible environment resets for each trajectory
Define action space (joint angles, gripper commands, deltas)

02

Interface

Worker Interface Development

Web-based teleoperation UI accessible from any browser.

Web-based teleoperation UI — keyboard, mouse, and joystick support
Clear instructions, tutorial video, and practice trials before recording
Real-time visual feedback on task progress and success signal
One-click submission when complete; no installation required

03

Calibrate

Pilot & Calibration

Small batch (n=50–100) to validate the interface and calibrate pricing.

Check if workers understand instructions correctly and complete tasks
Measure average completion time per trajectory
Set fair price — target ~$10–15 per hour for workers
Identify common failure modes and misunderstandings
Find which workers produce consistent high-quality data early

04

Collect

Large-Scale Parallel Collection

100–500 workers collecting in parallel on crowdsourcing platforms.

Launch HITs on MTurk, Prolific, or similar crowdsourcing platforms
Release batches gradually (e.g., 100 HITs at a time) to maintain quality control
Auto-save trajectories every few seconds to cloud storage
Store raw video + states + actions separately for flexibility
Monitor a real-time progress dashboard; track per-worker statistics

05

Filter

Automated Quality Filtering

Remove failed, outlier, and duplicate trajectories automatically.

Filter out timeout, failure, and too-short trajectories
Outlier detection based on trajectory length and success rate distributions
Clustering to remove duplicate or near-identical behavior patterns
Keep workers with >60% success rate; block poor performers early
Typical retention: 70–85% of all collected trajectories pass filtering

06

Process

Data Processing & Format Conversion

Normalize and convert raw trajectories into a training-ready dataset.

Resample observations to a consistent frequency (e.g., 10 Hz)
Extract action tensors from raw teleoperation input (joint angles, gripper)
Normalize observations and actions to zero-mean, unit-variance
Split into train / validation / test sets (e.g., 80 / 10 / 10)
Convert to framework dataset format (RLDS, HDF5, JSON, etc.)

07

Train

Train & Iterate

Train your policy, evaluate performance, identify gaps, and repeat.

Train policy via imitation learning (BC, DiffusionPolicy, ACT, etc.)
Evaluate success rate and generalization on held-out test set
Identify under-represented scenarios and task failure modes
Collect additional targeted data for those gaps
Repeat until target performance is reached — the flywheel compounds

✅

Key Success Factors

Best practices for getting high-quality data

🎮 Make It Easy to Control

Support multiple input modalities (mouse, keyboard, gamepad) with automatic sensitivity adjustment for each worker's setup.

📝 Clear Instructions

A 60-second video tutorial is worth 1,000 words. Show exactly what to do, what counts as success, and what to avoid.

⚡ Early Quality Filtering

Check quality after the first few trajectories from each worker. Block poor performers early to save money and maintain dataset quality.

💰 Fair Payment

Pay at least $10/hour. Better pay attracts better workers who produce higher quality data — this directly impacts model performance.

🔄 Allow Multiple Attempts

Workers improve with practice. Allowing retries produces better trajectories and reduces frustration-driven abandonment.

🎯 Auto-Reset Environment

One-click reset for failed attempts. Low-friction workflows keep workers engaged and completing more high-quality demonstrations.

⚠️ Common pitfall: Trying to collect too-complex tasks from non-expert workers. Start with simple, atomic tasks completable in 1–2 minutes. Chain simpler skills together instead of one complex monolithic task.

🤔

When Should You Use This Approach?

适用场景和不适用场景

✓ Good For

Simulation-based tasksAny task where you need many demonstrations in a sim environment

Imitation learningBehavior cloning needs massive, diverse demonstration data

Multi-task skill collectionCollect many different skills from different workers in parallel

Finetuning pre-trained policiesAdding diverse trajectories to improve generalization

✗ Less Suitable For

Real-world physical robotsPhysical hardware still needs in-lab collection

Ultra-precision tasksTasks requiring expert-level precision may still need domain experts

Safety-critical tasksWhen failure is extremely expensive, prefer expert-supervised collection

Very long-horizon tasksTasks over 10 minutes are hard for crowd workers — split into smaller steps

🚀

Getting Started

Your first data collection project checklist

1

Start small

Begin with a simple task taking 1–2 minutes per demonstration. Don't start with your most complex task.

2

Build the web interface

Use HTML5/JavaScript so workers just click a link — no installation needed. Three.js or Unity WebGL work well.

3

Make a tutorial

Record a 60-second screencast showing how to do the task. This single step dramatically improves data quality.

4

Run a pilot of 50 trajectories

Check results, see where workers struggle, adjust instructions and difficulty before scaling.

5

Scale up in batches

Release 100 trajectories at a time, monitor quality continuously, and retain your best workers.

6

Process and train

Run quality filters, convert to your dataset format, start training, and identify gaps for the next iteration.

💡 The XRollout platform provides built-in tools for distributed human data collection. Join our community to get access to the infrastructure.

📄

Original Document

Download the original PDF

This article summarizes the Human Data Collection Pipeline approach originally published at Physical Intelligence (π).

Download Original PDF

Human Data CollectionPipeline

Human Data Collection
Pipeline