The Data War: Who Owns Humanoid Robot Data in 2026?

In 2026, the race to build the best humanoid robot is no longer just about hardware. Tesla Optimus, Figure 02, Unitree H1 — they're all competing on a front that gets far less coverage but decides everything: training data. Whoever owns the largest, most diverse, highest-quality dataset owns the future of humanoid robotics.

This article is the first in a seven-part series — a full map of the data ownership landscape. By the end, you'll understand the four datasets reshaping the industry, how the "data flywheel" mechanism works, and why AgiBot's claim of "30% improvement over Open X-Embodiment" is a strategic signal that matters far beyond a benchmark number.

Series Roadmap: Who Owns Humanoid Robot Data in 2026?

Part	Title	Focus
1 (this article)	The Data War: Who Owns Humanoid Robot Data?	Map 4 major datasets, data flywheel mechanics, strategic analysis
2	Teleoperation: Real-World Data Collection	How AgiBot, Figure, Unitree collect data with teleop hardware
3	Human Video Mining: Learning from Humans	Using YouTube and internet video to pre-train robot policies
4	Synthetic Data Pipelines: From Sim to Real	Isaac Lab, MuJoCo, and synthetic trajectory generation at scale
5	VLA Data Scaling Laws	Scaling laws, data diversity vs quantity, diminishing returns
6	Data Strategy: What Should You Collect?	Practical guide for small teams and startups
7	Open vs Closed: Licenses, Data Moats & What's Next	Dataset licenses (CC-BY-NC vs Apache), data marketplaces, 2027 outlook

Why Data Is the New Oil for Humanoid Robots

Imagine teaching a robot to fold laundry. The classic approach — hardcoded kinematic programming — fails because every shirt is shaped differently, fabric state changes with each fold, and lighting conditions vary constantly. You can't "code" every possible situation.

Instead, Vision-Language-Action (VLA) models learn by watching thousands of demonstrations — each one slightly different, across many environments. This works, but it demands data that is large enough, diverse enough, and clean enough. That's the hardest unsolved problem in the field right now.

The Data Flywheel: A Self-Reinforcing Spiral

      Collect more data
             ↓
      Train better models
             ↓
  Deploy more robots in the real world
             ↓
      Collect more data
        (loop continues)

The "data flywheel" concept sounds simple but is enormously powerful: whoever starts spinning it first gains a compounding advantage. Better model → more robot deployments → more real-world data → even better model.

In the LLM world, OpenAI understood this early: ChatGPT wasn't just a product — it was a massive RLHF data collection machine. Robotics in 2026 is replaying that lesson, but with a much harder constraint: collecting physical interaction data costs orders of magnitude more than collecting text.

Four Datasets Reshaping the Game

1. AgiBot World — "The Million-Trajectory Army"

Paper: AgiBot World Colosseo (arXiv:2503.06669)
IROS 2025 Best Paper Award Finalist | IEEE TRO 2026

AgiBot World is the largest and most ambitious dataset in this group, built by AgiBot — a Chinese robotics startup backed by Alibaba. The numbers:

Metric	Value
Total trajectories	1,001,552 (~1M+)
Specific tasks	217
Skills covered	87
Distinct scenes	106
Total data duration	2,976 hours
Deployment scenarios	5 (kitchen, living room, warehouse, etc.)

The dataset uses a standardized collection pipeline with human-in-the-loop verification — every trajectory is quality-checked by a human before entering the dataset. This is a meaningful distinction from purely automated collection.

The paper introduces Genie Operator-1 (GO-1) — their latest policy using latent action representations to maximize data utilization. GO-1 achieves 60%+ success rate on complex tasks and outperforms the previous RDT approach by 32%.

The strategic point: AgiBot World is open-source (dataset, tools, and models are all public). But this isn't philanthropy — it's a deliberate move to build an ecosystem, attract global talent, and position AgiBot as the "Google of robotics data." Their flywheel keeps spinning; the most valuable part — the ability to collect new data every day from deployed robots — remains their private advantage.

2. Open X-Embodiment — "The Academic Open Standard"

Paper: Open X-Embodiment: Robotic Learning Datasets and RT-X Models (arXiv:2310.08864)

Published in 2023 through a collaboration between 21 research institutions worldwide, Open X-Embodiment (OXE) was the first attempt at a "common language" for robot training data:

Metric	Value
Institutions	21 organizations, 34 research labs
Robot embodiments	22 different robot types
Skills	527 (~160,000 tasks)
License	CC-BY (fully open, commercial use allowed)

OXE's most important contribution was data format standardization — you can combine data from a Stanford robot with data from Google DeepMind in the same training pipeline. The RT-X (Robot Transformer X) model trained on this dataset demonstrated positive transfer: learning from robot A measurably improves performance on robot B.

The weakness: OXE was built by academia, with many robot types but relatively few trajectories per task. It's a dataset that is wide but not deep — high diversity, but low data density per individual task, especially compared to AgiBot World's 1M+ tightly focused trajectories.

3. Physical Intelligence π0 — "The Silent Empire"

Paper: π₀: A Vision-Language-Action Flow Model for General Robot Control (arXiv:2410.24164)

Physical Intelligence (pi.ai) was co-founded by Sergey Levine (UC Berkeley) alongside top researchers from Google, Stanford, and CMU. π0 is their flagship model — and here's what's fascinating: we know almost nothing about its actual dataset.

What the paper tells us:

Aspect	Public information
Data sources	OXE + in-house proprietary data
Platforms	Single-arm, dual-arm, mobile manipulators
Tasks	Dexterous, multi-step (laundry folding, assembly)
Duration	100 seconds to several minutes per task
Actual in-house scale	Not disclosed
Architecture	Pre-trained VLM + action expert with flow matching loss

The real scale of their in-house dataset, task distribution details, and data collection infrastructure are all undisclosed. Physical Intelligence is playing the post-GPT-2 OpenAI playbook: open paper, closed weights, closed data.

Why this matters: By keeping their best datasets and model weights proprietary while still publishing research, Physical Intelligence earns academic credibility without surrendering competitive advantage. This is a long-term moat strategy — no competitor can replicate their data by simply reading the paper.

4. LeRobot — "The Democratization Movement"

GitHub: huggingface/lerobot

HuggingFace's LeRobot is the open-source community's answer to the entire data war. Rather than competing on raw scale with AgiBot or Physical Intelligence, LeRobot focuses on standardization and accessibility:

Feature	Detail
Datasets on Hub	181+ (and growing)
Notable datasets	DROID-100, ALOHA, ALOHA-2, RoboCasa, SO-100
Format	Standardized with PyTorch loaders
LeRobotDataset v3	Streaming — no need to download entire datasets
Hardware recipe	SO-100 arm (~$100), Koch v1.1

LeRobot's most distinctive contribution is hardware recipes — instructions for building cheap robots so that anyone can start collecting data in a standardized format. When thousands of contributors worldwide collect data with the same format, the aggregate community dataset can compete on diversity — even if not on the concentrated depth that a company like AgiBot can achieve.

Strategic Analysis: Open vs Proprietary Flywheel

┌──────────────────────────────────────────────────────────────────┐
│                    Data Strategy Landscape 2026                  │
├──────────────────────────┬───────────────────────────────────────┤
│  OPEN (Academic/OSS)     │  PROPRIETARY / HYBRID                 │
├──────────────────────────┼───────────────────────────────────────┤
│ Open X-Embodiment (OXE)  │ Physical Intelligence (π0)            │
│ • 21 institutions        │ • Scale: undisclosed                  │
│ • CC-BY license          │ • Strong commercial advantage         │
│ • 527 skills, 22 robots  │ • Multi-platform in-house data        │
│ • Sets academic standard │ • The silent leader?                  │
├──────────────────────────┼───────────────────────────────────────┤
│ LeRobot (HuggingFace)    │ AgiBot World (open + strategic)       │
│ • 181+ community sets    │ • 1M+ trajectories (publicly open)    │
│ • Cheap hardware recipe  │ • Alibaba-backed infrastructure       │
│ • Streaming format       │ • GO-1 policy open-source             │
│ • Community-driven       │ • Flywheel continues closed-door      │
└──────────────────────────┴───────────────────────────────────────┘

One thing is clear from this map: "open" doesn't mean "without strategy." AgiBot World opens its dataset while retaining competitive advantages in hardware and collection infrastructure. OXE is fully open but limited by its distributed academic model. LeRobot is the most open but depends on voluntary community contributions.

Physical Intelligence is the most interesting case study: by keeping the best data and weights proprietary while still publishing papers, they collect academic credit without surrendering commercial advantage.

Why AgiBot's "30% Improvement" Claim Matters Strategically

AgiBot's official claim: "Policies pre-trained on AgiBot World achieve an average performance improvement of 30% over those trained on Open X-Embodiment."

Technically, this is not an entirely apples-to-apples comparison (different hardware setups, different evaluation protocols, different task distributions). But strategically, it sends four important signals:

1. Scaling laws are confirmed in robotics. The same policy architecture trained on larger, more uniform data shows significant performance gains. This is no longer a hypothesis — it's empirical evidence.

2. Systems engineering beats pure research. AgiBot collected 1M+ trajectories not because they had more brilliant researchers, but because they built industrial-scale collection infrastructure with quality control pipelines. The gap between OXE and AgiBot World is about execution, not intelligence.

3. Benchmarks are a communication strategy. When AgiBot publishes this number, they're not just talking to the research community — they're signaling to investors, industry partners, and potential talent that "we are leading." Benchmark games are an integral part of the data war.

4. Opening data doesn't mean losing advantage. AgiBot exposes the dataset while retaining advantages in robot hardware, deployment infrastructure, and most importantly — the ability to collect new data every day from robots operating in the real world. The current dataset is the "show card"; the real flywheel keeps spinning behind closed doors.

The 2026 Landscape: Where Does Everyone Stand?

Dataset	Data Scale	Openness	Quality Control	Strategic Position
AgiBot World	★★★★★	★★★★☆	★★★★★	Aggressive challenger
Open X-Embodiment	★★★☆☆	★★★★★	★★★☆☆	Academic foundation
Physical Intelligence	Unknown	★☆☆☆☆	Estimated high	Silent leader?
LeRobot	★★★☆☆	★★★★★	★★★☆☆	Community enabler

There's no clear "winner" at this point — each player is winning by their own metrics. But if robotics follows the same trajectory as NLP (from pre-GPT3 to the ChatGPT era), the game will be decided when someone achieves critical mass: enough data, diverse enough, to train a model that genuinely generalizes across embodiments and environments.

The core question is: who gets there first? And when they do, will open-source still be able to compete?

The next articles in this series will go deeper into how each dataset is collected, why teleoperation remains the "gold standard" for high-quality data, and ultimately — what you need to do if you want to participate in this game with limited resources.

Part 2: Teleoperation — Real-World Data Collection — Teleop systems from AgiBot, Figure, Unitree and why human demonstration remains the gold standard
AgiBot World Dataset: Technical Deep Dive — Deep dive into GO-1 policy architecture and AgiBot World Colosseo's design principles
Embodied AI 2026: The Full Landscape — A wider view of where robotics AI stands today and where it's heading

Series Roadmap: Who Owns Humanoid Robot Data in 2026?

Part	Title	Focus
1 (this article)	The Data War: Who Owns Humanoid Robot Data?	Map 4 major datasets, data flywheel mechanics, strategic analysis
2	Teleoperation: Real-World Data Collection	How AgiBot, Figure, Unitree collect data with teleop hardware
3	Human Video Mining: Learning from Humans	Using YouTube and internet video to pre-train robot policies
4	Synthetic Data Pipelines: From Sim to Real	Isaac Lab, MuJoCo, and synthetic trajectory generation at scale
5	VLA Data Scaling Laws	Scaling laws, data diversity vs quantity, diminishing returns
6	Data Strategy: What Should You Collect?	Practical guide for small teams and startups
7	Open vs Closed: Licenses, Data Moats & What's Next	Dataset licenses (CC-BY-NC vs Apache), data marketplaces, 2027 outlook

Why Data Is the New Oil for Humanoid Robots

The Data Flywheel: A Self-Reinforcing Spiral

      Collect more data
             ↓
      Train better models
             ↓
  Deploy more robots in the real world
             ↓
      Collect more data
        (loop continues)

Four Datasets Reshaping the Game

1. AgiBot World — "The Million-Trajectory Army"

Paper: AgiBot World Colosseo (arXiv:2503.06669)
IROS 2025 Best Paper Award Finalist | IEEE TRO 2026

AgiBot World is the largest and most ambitious dataset in this group, built by AgiBot — a Chinese robotics startup backed by Alibaba. The numbers:

Metric	Value
Total trajectories	1,001,552 (~1M+)
Specific tasks	217
Skills covered	87
Distinct scenes	106
Total data duration	2,976 hours
Deployment scenarios	5 (kitchen, living room, warehouse, etc.)

2. Open X-Embodiment — "The Academic Open Standard"

Paper: Open X-Embodiment: Robotic Learning Datasets and RT-X Models (arXiv:2310.08864)

Published in 2023 through a collaboration between 21 research institutions worldwide, Open X-Embodiment (OXE) was the first attempt at a "common language" for robot training data:

Metric	Value
Institutions	21 organizations, 34 research labs
Robot embodiments	22 different robot types
Skills	527 (~160,000 tasks)
License	CC-BY (fully open, commercial use allowed)

3. Physical Intelligence π0 — "The Silent Empire"

Paper: π₀: A Vision-Language-Action Flow Model for General Robot Control (arXiv:2410.24164)

What the paper tells us:

Aspect	Public information
Data sources	OXE + in-house proprietary data
Platforms	Single-arm, dual-arm, mobile manipulators
Tasks	Dexterous, multi-step (laundry folding, assembly)
Duration	100 seconds to several minutes per task
Actual in-house scale	Not disclosed
Architecture	Pre-trained VLM + action expert with flow matching loss

4. LeRobot — "The Democratization Movement"

GitHub: huggingface/lerobot

Feature	Detail
Datasets on Hub	181+ (and growing)
Notable datasets	DROID-100, ALOHA, ALOHA-2, RoboCasa, SO-100
Format	Standardized with PyTorch loaders
LeRobotDataset v3	Streaming — no need to download entire datasets
Hardware recipe	SO-100 arm (~$100), Koch v1.1

Strategic Analysis: Open vs Proprietary Flywheel

┌──────────────────────────────────────────────────────────────────┐
│                    Data Strategy Landscape 2026                  │
├──────────────────────────┬───────────────────────────────────────┤
│  OPEN (Academic/OSS)     │  PROPRIETARY / HYBRID                 │
├──────────────────────────┼───────────────────────────────────────┤
│ Open X-Embodiment (OXE)  │ Physical Intelligence (π0)            │
│ • 21 institutions        │ • Scale: undisclosed                  │
│ • CC-BY license          │ • Strong commercial advantage         │
│ • 527 skills, 22 robots  │ • Multi-platform in-house data        │
│ • Sets academic standard │ • The silent leader?                  │
├──────────────────────────┼───────────────────────────────────────┤
│ LeRobot (HuggingFace)    │ AgiBot World (open + strategic)       │
│ • 181+ community sets    │ • 1M+ trajectories (publicly open)    │
│ • Cheap hardware recipe  │ • Alibaba-backed infrastructure       │
│ • Streaming format       │ • GO-1 policy open-source             │
│ • Community-driven       │ • Flywheel continues closed-door      │
└──────────────────────────┴───────────────────────────────────────┘

Why AgiBot's "30% Improvement" Claim Matters Strategically

AgiBot's official claim: "Policies pre-trained on AgiBot World achieve an average performance improvement of 30% over those trained on Open X-Embodiment."

The 2026 Landscape: Where Does Everyone Stand?

Dataset	Data Scale	Openness	Quality Control	Strategic Position
AgiBot World	★★★★★	★★★★☆	★★★★★	Aggressive challenger
Open X-Embodiment	★★★☆☆	★★★★★	★★★☆☆	Academic foundation
Physical Intelligence	Unknown	★☆☆☆☆	Estimated high	Silent leader?
LeRobot	★★★☆☆	★★★★★	★★★☆☆	Community enabler

The core question is: who gets there first? And when they do, will open-source still be able to compete?

Part 2: Teleoperation — Real-World Data Collection — Teleop systems from AgiBot, Figure, Unitree and why human demonstration remains the gold standard
AgiBot World Dataset: Technical Deep Dive — Deep dive into GO-1 policy architecture and AgiBot World Colosseo's design principles
Embodied AI 2026: The Full Landscape — A wider view of where robotics AI stands today and where it's heading

The Data War: Who Owns Humanoid Robot Data in 2026?

Series Roadmap: Who Owns Humanoid Robot Data in 2026?

Why Data Is the New Oil for Humanoid Robots

The Data Flywheel: A Self-Reinforcing Spiral

Four Datasets Reshaping the Game

1. AgiBot World — "The Million-Trajectory Army"

2. Open X-Embodiment — "The Academic Open Standard"

3. Physical Intelligence π0 — "The Silent Empire"

4. LeRobot — "The Democratization Movement"

Strategic Analysis: Open vs Proprietary Flywheel

Why AgiBot's "30% Improvement" Claim Matters Strategically

The 2026 Landscape: Where Does Everyone Stand?

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation

Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế

The Data War: Who Owns Humanoid Robot Data in 2026?

Series Roadmap: Who Owns Humanoid Robot Data in 2026?

Why Data Is the New Oil for Humanoid Robots

The Data Flywheel: A Self-Reinforcing Spiral

Four Datasets Reshaping the Game

1. AgiBot World — "The Million-Trajectory Army"

2. Open X-Embodiment — "The Academic Open Standard"

3. Physical Intelligence π0 — "The Silent Empire"

4. LeRobot — "The Democratization Movement"

Strategic Analysis: Open vs Proprietary Flywheel

Why AgiBot's "30% Improvement" Claim Matters Strategically

The 2026 Landscape: Where Does Everyone Stand?

Nguyễn Anh Tuấn

Related Posts

WholeBodyVLA: video egocentric + RL loco-manipulation

Vì sao VLA 2D chưa đủ cho manipulation

Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế