humanoidhumanoidrobot-datalicensedata-moatagibotopen-x-embodimentego4dlerobotteslafigure

Open vs Closed: Licenses, Data Moats, and What's Next

Compare robot dataset licenses, commercial training risk, closed fleet data moats, and the likely open vs closed landscape in 2027.

Nguyễn Anh TuấnJune 12, 202616 min read
Open vs Closed: Licenses, Data Moats, and What's Next

Final part: open data does not always mean free-for-commercial

The first six posts in this series covered the humanoid data landscape, teleoperation, human video mining, synthetic pipelines, VLA scaling laws, and practical data strategy for small teams. If you are joining here, start with Part 1: the humanoid data war landscape, Part 5: VLA data scaling, and Part 6: data strategy for small teams. This final post asks the strategic question: should robotics teams bet on open datasets, closed proprietary data, or a hybrid model?

The short answer: in 2026, "open" helps you learn faster, but it does not automatically grant commercial rights. A dataset can be publicly downloadable, backed by a strong paper, and surrounded by useful code examples, while still prohibiting use in a paid product. Conversely, a private dataset with a clear commercial license can be a much stronger asset if it comes from real robots, real environments, real failures, and clean usage rights.

This is not legal advice. It is an engineering and strategy checklist for robotics founders, ML engineers, and product teams: how to read dataset licenses, where commercial training risk appears, why Tesla and Figure have closed data moats, where open systems like LeRobot and AgiBot help, and what the 2027 open vs closed landscape may look like.

Robotics lab and control systems
Robotics lab and control systems

One license table that prevents big mistakes

Before talking about moats, look at three data families that appear often in embodied AI discussions:

Dataset / ecosystem Access model Main license Meaning for commercial models
AgiBot World / AgiBotWorld2026 Public download, very large dataset CC BY-NC-SA 4.0 Not usable for commercial purposes without separate rights. ShareAlike increases risk when distributing derivatives.
Open X-Embodiment Unified collection of datasets from many labs Repository is Apache-2.0; each subset needs metadata and original-license review Do not assume the entire mixture is commercial-safe. Audit subsets before training a product model.
Ego4D Requires reviewing and executing a license agreement, with time-limited credentials Custom license agreement, not a simple CC license Allows training and developing models within the stated "Purpose", including commercial product development under the agreement, but does not allow redistributing the database or giving third parties access to it.
LeRobot / Hugging Face datasets Open tooling, community-published datasets Depends on each dataset card: Apache-2.0, MIT, CC-BY, CC-BY-NC, custom LeRobot is the format and tooling layer. Commercial rights depend on the specific dataset, not the library.

The key point: the license of the code is not automatically the license of the data. The Open X-Embodiment repository uses Apache-2.0 for code, but the README also points users to metadata for the contributed datasets. If you train a commercial VLA on the full mixture, you need to know where each subset came from, which license applies, whether it contains human data, and whether downstream restrictions exist.

For AgiBot World, the official sources say that data and code in the repository are under CC BY-NC-SA 4.0. Creative Commons explains that NonCommercial bars uses primarily intended for commercial advantage or monetary compensation, and ShareAlike requires adaptations that are distributed to use the same license. For a startup, that is concrete: you can use AgiBot to learn, benchmark, run internal research, and study pipelines, but if a checkpoint enters a paid product, you need a separate commercial license or you need to remove NC data from the training lineage.

Ego4D is different. You do not simply download a public dataset; you execute an agreement. The start-here documentation says users must review and accept the license before receiving AWS credentials. The license draft states that users retain IP in software, algorithms, machine learning models, annotations, techniques, and technologies developed from using the database, and that those outputs may be used for academic, commercial, or noncommercial purposes, subject to the agreement. The same agreement also prohibits selling, renting, sublicensing, transferring, or giving third-party access to the database. In other words, permission to train a model is not permission to package raw video into your product or upload the data to a hub.

Three questions before training a commercial model

When reading a license, do not stop at "open-source" or "publicly available". Answer three questions:

Question Why it matters Example risk
Does the dataset allow commercial use? This is the first gate. If it is NonCommercial, paid products are high risk. Training a factory policy on CC BY-NC-SA data.
Are model weights treated as derivative/adapted material? Laws around models trained on data remain unsettled across jurisdictions. Contracts can be broader than copyright. You cannot prove that a checkpoint did not learn from a restricted subset.
Do you have an audit trail? During fundraising, enterprise sales, or customer reviews, you need to prove where data came from. Nobody knows the dataset version, subset list, download date, or license snapshot.

A minimum audit checklist for robotics teams:

dataset_audit:
  dataset_name: "example_robot_dataset"
  source_url: "https://..."
  downloaded_at: "2026-06-12"
  license_name: "CC-BY-4.0 / Apache-2.0 / custom"
  commercial_use_allowed: true
  contains_humans_or_faces: false
  contains_customer_ip: false
  redistribution_allowed: false
  model_training_allowed: true
  attribution_required: true
  sharealike_or_copyleft: false
  subsets_excluded:
    - "all CC-BY-NC subsets"
    - "all datasets without clear robot action logs"
  legal_review_required_before:
    - "shipping paid product"
    - "publishing checkpoint"
    - "enterprise contract"

Many small teams skip this because the work begins as "just research". Embodied AI is different from many text-only ML workflows. Robot data often includes factory video, private homes, faces, voices, workplace behavior, logos, production-line layout, or objects under NDA. License is only one layer. Privacy, consent, trade secrets, export controls, and customer contracts are separate layers.

The hardest issue is not naming the license. The hard issue is what training creates legally.

In traditional software, if you copy GPL code into a product, the risk is relatively legible. In AI, model weights do not copy the dataset in a way humans can directly inspect, but they may memorize, reconstruct, or encode patterns from the data. Robot policies add a practical risk: a policy trained on Factory A data may encode layout, process details, or unusual objects from that factory. If the data includes people, a model can learn gestures, faces, voices, or identifying information.

For commercial robotics, divide the risk into four layers:

Layer Question Risk reduction
Input data Do you have rights to use this data for training? License audit, provider contracts, remove NC or unknown subsets.
Training mixture Did you mix clean and restricted data? Versioned manifests, file hashes, experiment tracking.
Model artifact Are you allowed to publish or sell the checkpoint? Publish only when the license permits, with attribution, and without restricted data lineage.
Product behavior Can the robot reveal or reproduce sensitive information? Privacy evals, red-teaming, policy filters, customer-specific isolation.

A safer practice is to separate three checkpoint classes:

research_checkpoint:
  may use NC or custom research-license datasets
  not used for paid customers
  not deployed in product

commercial_pretrain_checkpoint:
  uses only commercial-safe datasets
  has a complete audit trail
  can be used in product

customer_finetune_checkpoint:
  trained on a specific customer's data
  usage rights governed by that customer contract
  not merged back into the general model unless rights allow it

This is where startups often get hurt: they use an experimental checkpoint in a customer demo, the demo works, and nobody remembers that the checkpoint contains lineage from a NonCommercial dataset. Six months later, enterprise due diligence asks for data provenance. Treat robot data lineage with the same seriousness as dependency license management in SaaS.

Closed data moats: Tesla and Figure are not just building robots

Tesla and Figure are often discussed as humanoid hardware companies. In the data war, the hardware is also the sensor and actuator surface for a flywheel.

Tesla states on its AI & Robotics page that it develops and deploys autonomy at scale in vehicles, robots, and beyond, using vision, planning, and inference hardware. Optimus does not have a public dataset like AgiBot, but Tesla's strategic advantage is a closed systems culture: in-house hardware, inference chips, data engines, deployment loops, and manufacturing environments. If Optimus is deployed inside Tesla factories, every failure, pause, intervention, and operator correction can become internal training signal. Competitors cannot download that from Hugging Face.

Figure is also building closed fleet data. Its Helix logistics report describes an internal VLA model, a low-level visuomotor policy, and strong results from just 8 hours of well-curated demonstration data for package manipulation. Project Go-Big is even more revealing: Figure says it used egocentric human video collected in real Brookfield homes to train Helix for navigation, with no robot demonstrations required for the initial result. Whether or not you believe every generalization claim, the strategic message is clear: Figure wants deployment, real-estate partnerships, and human video collection to become a private data engine.

Closed moats are powerful for three reasons:

Advantage Why it is hard to copy
Real fleet feedback Public datasets are snapshots. Fleet data is a continuous stream from real robots.
Private task distribution Tesla factories, Figure logistics, and home partnerships do not match public lab distributions.
Natural operational labels Intervention, success, failure, recovery, and operator correction are high-value labels.

The downside is cost. You must build robots, deploy them, maintain them, run teleoperation or human support, build the data pipeline, handle privacy, and retrain continuously. A closed data moat is a game for companies with capital, supply chain, and real deployment customers.

The open ecosystem: LeRobot, AgiBot, and the power of shared standards

The open ecosystem does not win by secrecy. It wins by spreading knowledge quickly.

LeRobot does something basic but crucial: it standardizes how robotics teams record, store, stream, visualize, and train robot datasets. The LeRobotDataset v3 documentation describes a unified format for multimodal time-series data, sensorimotor signals, multi-camera video, metadata, and streaming directly from the Hugging Face Hub. When many labs use the same format, the community can reuse dataloaders, visualizers, evaluators, and training scripts.

AgiBot sits in a more unusual position. Its public dataset is large, but the CC BY-NC-SA license makes it closer to a research commons than a commercial commons. That is still extremely valuable. It helps students, labs, and startups learn pipelines, benchmark policies, analyze task distributions, test architectures, and build format conversion tools. But if the goal is a paid product, AgiBot cannot be the default commercial foundation unless a separate agreement exists.

Open X-Embodiment teaches another lesson: a large multi-embodiment mixture can push cross-embodiment research and RT-X-style models, but because the mixture comes from many contributors, license and quality are also mixed. This is the future of open robotics: not one dataset to rule them all, but many datasets with standardized metadata and clear licenses so training teams can choose appropriate subsets.

A quick comparison:

Model Strength Weakness Best users
Closed fleet Proprietary, product-matched, continuous feedback Capital intensive, operationally hard, high privacy risk Tesla, Figure, 1X, Unitree, and teams with real deployment
Research open Fast learning, strong benchmarks, broad community May prohibit commercial use, may not match product distribution Labs, students, prototype-stage startups
Commercial open Clear commercial rights, attribution, auditability Smaller, more expensive, or governance-heavy Startups planning to ship product
Hybrid Open for pretraining/tooling, proprietary data for finetuning Lineage management is harder Most practical teams

Data marketplaces: what robotics still lacks in 2026

In LLMs, data markets are familiar: web crawl, licensed text, synthetic instruction, human preference data, and enterprise documents. Robotics lacks a mature data marketplace because the data is not just a file. A good robot dataset needs:

Component Why it matters
Multi-camera video VLA observations, occlusion debugging, and context.
Synchronized state/action Without action logs, video is mostly perception pretraining data.
Robot metadata Embodiment, joint order, gripper, camera pose, control frequency.
Task and success labels You need to know whether the episode succeeded and what instruction was given.
Consent and privacy metadata Especially for egocentric human video, private homes, and workplaces.
Machine-readable license Training pipelines can automatically exclude incompatible subsets.
Standard eval split Not only training data, but benchmarks for comparing policies.

The 2027 robotics data marketplace may not look like "buy a zip file". It may look more like "buy rights to use a distribution":

Package handling data:
  robot: dual-arm mobile manipulator
  environment: warehouse conveyor
  episodes: 50,000
  modalities: front camera, wrist camera, joint state, gripper force
  labels: barcode visible, grasp success, reorientation success
  license: commercial training allowed, no redistribution
  privacy: no faces, no customer labels, sanitized backgrounds
  eval: 2,000 held-out episodes across unseen packages

The key word is provenance. Buyers will not only ask "how many hours of video?" They will ask "who owns the rights?", "did operators consent?", "does this include customer data?", "can it train a foundation model?", "can we sell the checkpoint?", and "does the license force us to share derivatives?"

2027 forecast: open wins tooling, closed wins deployment

I do not think one side wins everything in 2027. The more likely outcome is a layered landscape:

Layer Likely 2027 advantage
Tooling, formats, visualizers, dataloaders Open ecosystem, especially LeRobot/Hugging Face and similar standards.
Foundation model research Open-weight models and research datasets, but commercial licenses remain fragmented.
Production manipulation in a specific domain Closed fleet data from companies deploying real robots.
Human video pretraining Players with large partnerships and clear consent.
Enterprise robotics Hybrid: open tooling, commercial-safe pretraining, customer-specific finetuning.

In 2027, "we have a VLA model" will not be enough. The real questions will be:

What data was this model trained on?
Does the license allow commercial deployment?
Has the robot seen a distribution like the customer's workflow?
When the robot fails, does failure data return to the training loop?
Can the team prove lineage to legal and procurement teams?

That turns data governance into a core engineering capability. A serious robotics team needs more than ML engineers and controls engineers. It needs data engineers who understand Parquet, video, and time sync; ML ops engineers who understand checkpoint lineage; product engineers who understand customer workflows; and legal/compliance support early enough to avoid building on unusable data.

A practical strategy for small teams

If you are a startup or lab, do not try to copy Tesla. You do not have an Optimus fleet. Also do not download AgiBot and assume you have a data moat. Your advantage is narrower: choose a focused domain, understand the local environment, collect data close to the real task, and use open tooling to move quickly.

If you need a more hands-on starting point, read the LeRobot Humanoid $2500 guide and the GR00T N1 + G1 data collection guide. Those posts go deeper into setup; this one focuses on usage rights and data moat strategy.

A reasonable strategy:

Stage Open assets to use Proprietary data to collect License posture
Learn the pipeline LeRobot, Open X-Embodiment samples, AgiBot for research 20-50 toy-task episodes Do not commercialize the research checkpoint.
Product prototype LeRobotDataset format, commercial-safe open-weight models 100-500 episodes from the real task Use only commercial-safe data for sales demos.
Customer pilot Open tooling and clean pretrained models Customer-site data under contract Do not merge customer data into the general model unless rights allow it.
Scale Marketplace data or data partnerships Fleet failure and intervention data Lineage and audit are mandatory.

Put simply:

Use open to learn fast.
Use closed to build a moat.
Use commercial-safe data to ship.
Use a hybrid strategy to survive.

Series conclusion: who owns humanoid robot data?

After seven posts, the answer is not a single company.

AgiBot owns one of the largest and most useful research datasets for the community, but its NonCommercial license limits product paths. Open X-Embodiment owns the cross-embodiment mixture idea and a unification layer, but users must audit each subset. Ego4D shows that human video can be a huge pretraining asset, but access runs through a strict license agreement. LeRobot does not own all robot data, but it may own the standardization layer that makes robot data easier to share. Tesla, Figure, and other deployment-focused humanoid companies own the hardest thing to publish: failure distributions from real robots in real environments.

So the 2027 winner may not be the team with the most terabytes. The winner is the team that combines four things:

Ingredient Why it matters
Data from the right distribution The robot must learn the work customers actually need.
Clean usage rights Enterprise robotics cannot scale on blind license risk.
Deploy-feedback loop New failure data must return to the model quickly.
Good open standards Shared tooling lowers cost and attracts community.

The humanoid data war is not simply "open versus closed". It is a contest between teams that understand data as a product asset and teams that treat data as training files. Open makes the industry learn faster. Closed helps companies create durable advantage. Licenses decide whether that advantage can become a real product.

Sources

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế
humanoid

Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế

6/12/202616 min read
NT
Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?
humanoid

Data Strategy: Team Nhỏ Nên Thu Thập Dữ Liệu Gì?

6/12/202619 min read
NT
VLA Data Scaling: Luật Scaling Cho Robot Learning
humanoid

VLA Data Scaling: Luật Scaling Cho Robot Learning

6/12/202619 min read
NT