In 2026, the race to build the best humanoid robot is no longer just about hardware. Tesla Optimus, Figure 02, Unitree H1 — they're all competing on a front that gets far less coverage but decides everything: training data. Whoever owns the largest, most diverse, highest-quality dataset owns the future of humanoid robotics.
This article is the first in a six-part series — a full map of the data ownership landscape. By the end, you'll understand the four datasets reshaping the industry, how the "data flywheel" mechanism works, and why AgiBot's claim of "30% improvement over Open X-Embodiment" is a strategic signal that matters far beyond a benchmark number.
Series Roadmap: Who Owns Humanoid Robot Data in 2026?
| Part | Title | Focus |
|---|---|---|
| 1 (this article) | The Data War: Who Owns Humanoid Robot Data? | Map 4 major datasets, data flywheel mechanics, strategic analysis |
| 2 | Teleoperation: Real-World Data Collection | How AgiBot, Figure, Unitree collect data with teleop hardware |
| 3 | Human Video Mining: Learning from Humans | Using YouTube and internet video to pre-train robot policies |
| 4 | Synthetic Data Pipelines: From Sim to Real | Isaac Lab, MuJoCo, and synthetic trajectory generation at scale |
| 5 | VLA Data Scaling Laws | Scaling laws, data diversity vs quantity, diminishing returns |
| 6 | Data Strategy: What Should You Collect? | Practical guide for small teams and startups |
Why Data Is the New Oil for Humanoid Robots
Imagine teaching a robot to fold laundry. The classic approach — hardcoded kinematic programming — fails because every shirt is shaped differently, fabric state changes with each fold, and lighting conditions vary constantly. You can't "code" every possible situation.
Instead, Vision-Language-Action (VLA) models learn by watching thousands of demonstrations — each one slightly different, across many environments. This works, but it demands data that is large enough, diverse enough, and clean enough. That's the hardest unsolved problem in the field right now.
The Data Flywheel: A Self-Reinforcing Spiral
Collect more data
↓
Train better models
↓
Deploy more robots in the real world
↓
Collect more data
(loop continues)
The "data flywheel" concept sounds simple but is enormously powerful: whoever starts spinning it first gains a compounding advantage. Better model → more robot deployments → more real-world data → even better model.
In the LLM world, OpenAI understood this early: ChatGPT wasn't just a product — it was a massive RLHF data collection machine. Robotics in 2026 is replaying that lesson, but with a much harder constraint: collecting physical interaction data costs orders of magnitude more than collecting text.
Four Datasets Reshaping the Game
1. AgiBot World — "The Million-Trajectory Army"
Paper: AgiBot World Colosseo (arXiv:2503.06669)
IROS 2025 Best Paper Award Finalist | IEEE TRO 2026
AgiBot World is the largest and most ambitious dataset in this group, built by AgiBot — a Chinese robotics startup backed by Alibaba. The numbers:
| Metric | Value |
|---|---|
| Total trajectories | 1,001,552 (~1M+) |
| Specific tasks | 217 |
| Skills covered | 87 |
| Distinct scenes | 106 |
| Total data duration | 2,976 hours |
| Deployment scenarios | 5 (kitchen, living room, warehouse, etc.) |
The dataset uses a standardized collection pipeline with human-in-the-loop verification — every trajectory is quality-checked by a human before entering the dataset. This is a meaningful distinction from purely automated collection.
The paper introduces Genie Operator-1 (GO-1) — their latest policy using latent action representations to maximize data utilization. GO-1 achieves 60%+ success rate on complex tasks and outperforms the previous RDT approach by 32%.
The strategic point: AgiBot World is open-source (dataset, tools, and models are all public). But this isn't philanthropy — it's a deliberate move to build an ecosystem, attract global talent, and position AgiBot as the "Google of robotics data." Their flywheel keeps spinning; the most valuable part — the ability to collect new data every day from deployed robots — remains their private advantage.
2. Open X-Embodiment — "The Academic Open Standard"
Paper: Open X-Embodiment: Robotic Learning Datasets and RT-X Models (arXiv:2310.08864)
Published in 2023 through a collaboration between 21 research institutions worldwide, Open X-Embodiment (OXE) was the first attempt at a "common language" for robot training data:
| Metric | Value |
|---|---|
| Institutions | 21 organizations, 34 research labs |
| Robot embodiments | 22 different robot types |
| Skills | 527 (~160,000 tasks) |
| License | CC-BY (fully open, commercial use allowed) |
OXE's most important contribution was data format standardization — you can combine data from a Stanford robot with data from Google DeepMind in the same training pipeline. The RT-X (Robot Transformer X) model trained on this dataset demonstrated positive transfer: learning from robot A measurably improves performance on robot B.
The weakness: OXE was built by academia, with many robot types but relatively few trajectories per task. It's a dataset that is wide but not deep — high diversity, but low data density per individual task, especially compared to AgiBot World's 1M+ tightly focused trajectories.
3. Physical Intelligence π0 — "The Silent Empire"
Paper: π₀: A Vision-Language-Action Flow Model for General Robot Control (arXiv:2410.24164)
Physical Intelligence (pi.ai) was co-founded by Sergey Levine (UC Berkeley) alongside top researchers from Google, Stanford, and CMU. π0 is their flagship model — and here's what's fascinating: we know almost nothing about its actual dataset.
What the paper tells us:
| Aspect | Public information |
|---|---|
| Data sources | OXE + in-house proprietary data |
| Platforms | Single-arm, dual-arm, mobile manipulators |
| Tasks | Dexterous, multi-step (laundry folding, assembly) |
| Duration | 100 seconds to several minutes per task |
| Actual in-house scale | Not disclosed |
| Architecture | Pre-trained VLM + action expert with flow matching loss |
The real scale of their in-house dataset, task distribution details, and data collection infrastructure are all undisclosed. Physical Intelligence is playing the post-GPT-2 OpenAI playbook: open paper, closed weights, closed data.
Why this matters: By keeping their best datasets and model weights proprietary while still publishing research, Physical Intelligence earns academic credibility without surrendering competitive advantage. This is a long-term moat strategy — no competitor can replicate their data by simply reading the paper.
4. LeRobot — "The Democratization Movement"
GitHub: huggingface/lerobot
HuggingFace's LeRobot is the open-source community's answer to the entire data war. Rather than competing on raw scale with AgiBot or Physical Intelligence, LeRobot focuses on standardization and accessibility:
| Feature | Detail |
|---|---|
| Datasets on Hub | 181+ (and growing) |
| Notable datasets | DROID-100, ALOHA, ALOHA-2, RoboCasa, SO-100 |
| Format | Standardized with PyTorch loaders |
| LeRobotDataset v3 | Streaming — no need to download entire datasets |
| Hardware recipe | SO-100 arm (~$100), Koch v1.1 |
LeRobot's most distinctive contribution is hardware recipes — instructions for building cheap robots so that anyone can start collecting data in a standardized format. When thousands of contributors worldwide collect data with the same format, the aggregate community dataset can compete on diversity — even if not on the concentrated depth that a company like AgiBot can achieve.
Strategic Analysis: Open vs Proprietary Flywheel
┌──────────────────────────────────────────────────────────────────┐
│ Data Strategy Landscape 2026 │
├──────────────────────────┬───────────────────────────────────────┤
│ OPEN (Academic/OSS) │ PROPRIETARY / HYBRID │
├──────────────────────────┼───────────────────────────────────────┤
│ Open X-Embodiment (OXE) │ Physical Intelligence (π0) │
│ • 21 institutions │ • Scale: undisclosed │
│ • CC-BY license │ • Strong commercial advantage │
│ • 527 skills, 22 robots │ • Multi-platform in-house data │
│ • Sets academic standard │ • The silent leader? │
├──────────────────────────┼───────────────────────────────────────┤
│ LeRobot (HuggingFace) │ AgiBot World (open + strategic) │
│ • 181+ community sets │ • 1M+ trajectories (publicly open) │
│ • Cheap hardware recipe │ • Alibaba-backed infrastructure │
│ • Streaming format │ • GO-1 policy open-source │
│ • Community-driven │ • Flywheel continues closed-door │
└──────────────────────────┴───────────────────────────────────────┘
One thing is clear from this map: "open" doesn't mean "without strategy." AgiBot World opens its dataset while retaining competitive advantages in hardware and collection infrastructure. OXE is fully open but limited by its distributed academic model. LeRobot is the most open but depends on voluntary community contributions.
Physical Intelligence is the most interesting case study: by keeping the best data and weights proprietary while still publishing papers, they collect academic credit without surrendering commercial advantage.
Why AgiBot's "30% Improvement" Claim Matters Strategically
AgiBot's official claim: "Policies pre-trained on AgiBot World achieve an average performance improvement of 30% over those trained on Open X-Embodiment."
Technically, this is not an entirely apples-to-apples comparison (different hardware setups, different evaluation protocols, different task distributions). But strategically, it sends four important signals:
1. Scaling laws are confirmed in robotics. The same policy architecture trained on larger, more uniform data shows significant performance gains. This is no longer a hypothesis — it's empirical evidence.
2. Systems engineering beats pure research. AgiBot collected 1M+ trajectories not because they had more brilliant researchers, but because they built industrial-scale collection infrastructure with quality control pipelines. The gap between OXE and AgiBot World is about execution, not intelligence.
3. Benchmarks are a communication strategy. When AgiBot publishes this number, they're not just talking to the research community — they're signaling to investors, industry partners, and potential talent that "we are leading." Benchmark games are an integral part of the data war.
4. Opening data doesn't mean losing advantage. AgiBot exposes the dataset while retaining advantages in robot hardware, deployment infrastructure, and most importantly — the ability to collect new data every day from robots operating in the real world. The current dataset is the "show card"; the real flywheel keeps spinning behind closed doors.
The 2026 Landscape: Where Does Everyone Stand?
| Dataset | Data Scale | Openness | Quality Control | Strategic Position |
|---|---|---|---|---|
| AgiBot World | ★★★★★ | ★★★★☆ | ★★★★★ | Aggressive challenger |
| Open X-Embodiment | ★★★☆☆ | ★★★★★ | ★★★☆☆ | Academic foundation |
| Physical Intelligence | Unknown | ★☆☆☆☆ | Estimated high | Silent leader? |
| LeRobot | ★★★☆☆ | ★★★★★ | ★★★☆☆ | Community enabler |
There's no clear "winner" at this point — each player is winning by their own metrics. But if robotics follows the same trajectory as NLP (from pre-GPT3 to the ChatGPT era), the game will be decided when someone achieves critical mass: enough data, diverse enough, to train a model that genuinely generalizes across embodiments and environments.
The core question is: who gets there first? And when they do, will open-source still be able to compete?
The next articles in this series will go deeper into how each dataset is collected, why teleoperation remains the "gold standard" for high-quality data, and ultimately — what you need to do if you want to participate in this game with limited resources.
Read next: Part 2: Teleoperation — Real-World Data Collection.
Related Posts
- Part 2: Teleoperation — Real-World Data Collection — Teleop systems from AgiBot, Figure, Unitree and why human demonstration remains the gold standard
- AgiBot World Dataset: Technical Deep Dive — Deep dive into GO-1 policy architecture and AgiBot World Colosseo's design principles
- Embodied AI 2026: The Full Landscape — A wider view of where robotics AI stands today and where it's heading