Unitree G1: Training a Walking Policy from Scratch in Isaac Lab

After two theory posts on MDP formulation and reward engineering, it is time to get our hands dirty with real training. In this post, we will train a complete walking policy for the Unitree G1 — from zero to walking robot — using PPO in Isaac Lab.

Unitree G1 — Technical Specifications

The Unitree G1 is a compact humanoid well-suited for research and practical applications:

Parameter	Value
Height	1.27 m
Weight	35 kg
Total DOF	23 (legs 12, arms 8, torso 3)
Leg DOF (per side)	6 (hip roll/pitch/yaw, knee, ankle roll/pitch)
Max speed	~2 m/s
Battery life	~2 hours
Onboard compute	Jetson Orin NX
Price	~$16,000 USD

Compared to the H1 (1.8m, 47kg, 19 leg DOF), the G1 is smaller and easier to experiment with. But the lower center of mass also means different dynamics — similar to the difference between how children and adults walk.

Humanoid robot walking

Step 1: Load G1 URDF into Isaac Lab

Robot Model Configuration

"""
Step 1: Configure Unitree G1 in Isaac Lab.
File: g1_env_cfg.py
"""
import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.assets import ArticulationCfg
from omni.isaac.lab.envs import ManagerBasedRLEnvCfg
from omni.isaac.lab.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.lab.managers import RewardTermCfg, TerminationTermCfg
from omni.isaac.lab.scene import InteractiveSceneCfg
from omni.isaac.lab.terrains import TerrainImporterCfg
from omni.isaac.lab.utils import configclass

@configclass
class G1RobotCfg:
    """Unitree G1 robot configuration."""

    robot = ArticulationCfg(
        prim_path="/World/envs/env_.*/Robot",
        spawn=sim_utils.UsdFileCfg(
            usd_path="datasets/robots/unitree/g1/g1.usd",
            activate_contact_sensors=True,
        ),
        init_state=ArticulationCfg.InitialStateCfg(
            pos=(0.0, 0.0, 0.75),
            joint_pos={
                # === Left leg ===
                "left_hip_yaw": 0.0,
                "left_hip_roll": 0.0,
                "left_hip_pitch": -0.15,
                "left_knee": 0.35,
                "left_ankle_pitch": -0.20,
                "left_ankle_roll": 0.0,
                # === Right leg ===
                "right_hip_yaw": 0.0,
                "right_hip_roll": 0.0,
                "right_hip_pitch": -0.15,
                "right_knee": 0.35,
                "right_ankle_pitch": -0.20,
                "right_ankle_roll": 0.0,
                # Arms: slightly bent
                ".*shoulder.*": 0.0,
                ".*elbow.*": 0.3,
            },
        ),
        actuators={
            "legs": sim_utils.DCMotorCfg(
                joint_names_expr=[
                    ".*hip.*", ".*knee.*", ".*ankle.*"
                ],
                stiffness={
                    ".*hip.*": 150.0,
                    ".*knee.*": 200.0,
                    ".*ankle.*": 40.0,
                },
                damping={
                    ".*hip.*": 5.0,
                    ".*knee.*": 8.0,
                    ".*ankle.*": 2.0,
                },
                effort_limit=100.0,
            ),
            "arms": sim_utils.DCMotorCfg(
                joint_names_expr=[".*shoulder.*", ".*elbow.*"],
                stiffness=80.0,
                damping=4.0,
                effort_limit=50.0,
            ),
        },
    )

Full Environment Configuration

@configclass
class G1FlatWalkEnvCfg(ManagerBasedRLEnvCfg):
    """Complete environment for G1 flat-ground walking."""

    # === Simulation ===
    sim = sim_utils.SimulationCfg(
        dt=0.005,                    # 200 Hz physics
        render_interval=4,
        gravity=(0.0, 0.0, -9.81),
    )

    # === Scene ===
    scene = InteractiveSceneCfg(
        num_envs=4096,               # 4096 parallel environments
        env_spacing=2.5,
    )

    # === Terrain (flat ground) ===
    terrain = TerrainImporterCfg(
        prim_path="/World/ground",
        terrain_type="plane",
        physics_material=sim_utils.RigidBodyMaterialCfg(
            static_friction=1.0,
            dynamic_friction=1.0,
            restitution=0.0,
        ),
    )

    # === Robot ===
    robot = G1RobotCfg.robot

    # === Observations ===
    observations = ObservationGroupCfg(
        policy=ObservationGroupCfg(
            concatenate_terms=True,
            terms={
                "base_ang_vel": ObservationTermCfg(
                    func="base_ang_vel", noise={"mean": 0, "std": 0.05}
                ),
                "projected_gravity": ObservationTermCfg(
                    func="projected_gravity",
                ),
                "velocity_commands": ObservationTermCfg(
                    func="generated_commands",
                    params={"command_name": "base_velocity"},
                ),
                "joint_pos": ObservationTermCfg(
                    func="joint_pos_rel",
                    noise={"mean": 0, "std": 0.01},
                ),
                "joint_vel": ObservationTermCfg(
                    func="joint_vel_rel",
                    noise={"mean": 0, "std": 0.1},
                    scale=0.05,
                ),
                "actions": ObservationTermCfg(
                    func="last_action",
                ),
            },
        ),
    )

    # === Control ===
    decimation = 4                   # Policy at 50 Hz
    episode_length_s = 20.0

    # === Commands ===
    commands = {
        "base_velocity": {
            "ranges": {
                "lin_vel_x": (-0.5, 1.0),    # Forward/backward
                "lin_vel_y": (-0.3, 0.3),     # Lateral
                "ang_vel_z": (-0.5, 0.5),     # Yaw rate
            },
            "resampling_time": 10.0,
        },
    }

Step 2: Configure PPO Hyperparameters

PPO (Proximal Policy Optimization) is the most widely used algorithm for locomotion due to its stability and efficiency:

"""
Step 2: PPO training configuration.
File: g1_ppo_cfg.py
"""

class G1PPORunnerCfg:
    """PPO configuration for G1 walking."""

    # === Algorithm ===
    algorithm = {
        "class_name": "PPO",
        "clip_param": 0.2,           # PPO clip range
        "entropy_coef": 0.01,        # Encourage exploration
        "gamma": 0.99,               # Discount factor
        "lam": 0.95,                 # GAE lambda
        "learning_rate": 1e-3,       # Initial LR
        "lr_schedule": "adaptive",   # Reduce LR when KL divergence is high
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
        "num_learning_epochs": 5,    # Epochs per update
        "num_mini_batches": 4,
        "value_loss_coef": 1.0,
    }

    # === Policy network ===
    policy = {
        "class_name": "ActorCritic",
        "activation": "elu",
        "actor_hidden_dims": [512, 256, 128],
        "critic_hidden_dims": [512, 256, 128],
        "init_noise_std": 1.0,
    }

    # === Runner ===
    runner = {
        "num_steps_per_env": 24,     # Steps per env per update
        "max_iterations": 5000,      # Total iterations
        "save_interval": 500,
        "log_interval": 10,
    }

Key Hyperparameter Explanations

Parameter	Value	Why?
`num_envs`	4096	Enough data for stable gradients. Increase to 8192 if VRAM allows
`clip_param`	0.2	Standard PPO. Reduce to 0.1 if training is unstable
`entropy_coef`	0.01	Too high = random actions. Too low = local optima
`learning_rate`	1e-3	Adaptive schedule auto-reduces when needed
`actor_hidden_dims`	[512,256,128]	Enough capacity for 23 DOF. Too small = underfitting
`num_steps_per_env`	24	~0.48s real-time per update. Increase to 48 for long-horizon tasks
`init_noise_std`	1.0	High initial exploration, auto-decays

Step 3: Reward Function for G1

Applying knowledge from the reward engineering post:

"""
Step 3: Reward terms for G1 walking.
File: g1_rewards.py
"""
import torch
from omni.isaac.lab.managers import RewardTermCfg

class G1RewardsCfg:
    """Reward configuration for G1 flat walking."""

    # === Primary tracking ===
    lin_vel_xy_tracking = RewardTermCfg(
        func="track_lin_vel_xy_exp",
        weight=1.5,
        params={"std": 0.25, "command_name": "base_velocity"},
    )
    ang_vel_z_tracking = RewardTermCfg(
        func="track_ang_vel_z_exp",
        weight=0.8,
        params={"std": 0.25, "command_name": "base_velocity"},
    )

    # === Posture ===
    flat_orientation = RewardTermCfg(
        func="flat_orientation_l2", weight=-1.0,
    )
    base_height = RewardTermCfg(
        func="base_height_l2",
        weight=-0.5,
        params={"target_height": 0.72},
    )

    # === Regularization ===
    action_rate = RewardTermCfg(
        func="action_rate_l2", weight=-0.01,
    )
    joint_torques = RewardTermCfg(
        func="joint_torques_l2", weight=-5e-5,
    )
    joint_accel = RewardTermCfg(
        func="joint_accel_l2", weight=-1e-4,
    )

    # === Gait ===
    feet_air_time = RewardTermCfg(
        func="feet_air_time",
        weight=0.2,
        params={"threshold": 0.3, "sensor_cfg": "contact_forces"},
    )
    undesired_contacts = RewardTermCfg(
        func="undesired_contacts",
        weight=-1.0,
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            "body_names": [".*knee.*", ".*torso.*", ".*hip.*"],
        },
    )

    # === Termination ===
    termination_penalty = RewardTermCfg(
        func="is_terminated", weight=-200.0,
    )

    # === Velocity penalty ===
    lin_vel_z = RewardTermCfg(
        func="lin_vel_z_l2", weight=-0.5,
    )
    ang_vel_xy = RewardTermCfg(
        func="ang_vel_xy_l2", weight=-0.05,
    )

Training setup GPU

Step 4: Run Training

Training Commands

# Training on RTX 4090 — approximately 45-60 minutes
cd IsaacLab
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless

# With WandB logging
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless \
    --logger wandb \
    --wandb_project g1-locomotion

Typical Training Schedule

Iteration 0-200:    Robot falls constantly, reward ~-100
Iteration 200-500:  Starts standing, reward ~-20 to 0
Iteration 500-1500: First steps! reward increases to 5-10
Iteration 1500-3000: Stable walking, reward ~10-15
Iteration 3000-5000: Fine-tuning, reward ~15-20 (plateau)

Monitoring Training

"""
Script to monitor training progress.
File: monitor_training.py
"""
import matplotlib.pyplot as plt
import json
import glob
import numpy as np

def plot_training_curves(log_dir):
    """Plot key metrics during training."""

    # Metrics to track
    metrics = {
        "reward/total": [],
        "reward/lin_vel_tracking": [],
        "reward/termination": [],
        "loss/policy": [],
        "loss/value": [],
        "diagnostics/mean_episode_length": [],
    }

    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Plot 1: Total reward
    axes[0, 0].set_title("Total Reward")
    axes[0, 0].set_xlabel("Iteration")
    axes[0, 0].set_ylabel("Reward")

    # Plot 2: Velocity tracking
    axes[0, 1].set_title("Velocity Tracking Reward")

    # Plot 3: Episode length (longer = robot doesn't fall)
    axes[0, 2].set_title("Episode Length (s)")

    # Plot 4: Policy loss
    axes[1, 0].set_title("Policy Loss")

    # Plot 5: Value loss
    axes[1, 1].set_title("Value Loss")

    # Plot 6: Termination rate
    axes[1, 2].set_title("Termination Rate")

    plt.tight_layout()
    plt.savefig("training_curves.png", dpi=150)
    print("Saved training_curves.png")

if __name__ == "__main__":
    plot_training_curves("logs/g1_flat_walk")

Step 5: Evaluate the Policy

Running the Trained Policy

# Evaluate with rendering
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=16 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt

# Record video
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=1 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt \
    --video \
    --video_length=300

Quantitative Evaluation

"""
Evaluate walking policy quality.
File: evaluate_g1.py
"""
import torch
import numpy as np

class G1PolicyEvaluator:
    """Evaluate walking policy on multiple criteria."""

    def __init__(self, env, policy, num_eval_steps=1000):
        self.env = env
        self.policy = policy
        self.num_eval_steps = num_eval_steps

    def evaluate(self):
        """Run full evaluation."""
        metrics = {
            "velocity_tracking_error": [],
            "energy_per_meter": [],
            "survival_rate": [],
            "gait_symmetry": [],
            "smoothness": [],
        }

        obs = self.env.reset()
        for step in range(self.num_eval_steps):
            action = self.policy(obs)
            obs, reward, done, info = self.env.step(action)

            # 1. Velocity tracking error
            cmd_vel = info["velocity_command"][:, 0]
            actual_vel = info["base_lin_vel"][:, 0]
            vel_error = torch.abs(cmd_vel - actual_vel).mean().item()
            metrics["velocity_tracking_error"].append(vel_error)

            # 2. Energy consumption (Cost of Transport)
            torques = info["applied_torques"]
            joint_vel = info["joint_velocities"]
            power = torch.sum(torch.abs(torques * joint_vel), dim=1)
            velocity = torch.norm(info["base_lin_vel"][:, :2], dim=1)
            cot = power / (35.0 * 9.81 * torch.clamp(velocity, min=0.1))
            metrics["energy_per_meter"].append(cot.mean().item())

            # 3. Gait symmetry
            left_phase = info["left_foot_contact_time"]
            right_phase = info["right_foot_contact_time"]
            symmetry = 1.0 - torch.abs(
                left_phase - right_phase
            ).mean().item()
            metrics["gait_symmetry"].append(symmetry)

        # Aggregate results
        results = {}
        for key, values in metrics.items():
            results[key] = {
                "mean": np.mean(values),
                "std": np.std(values),
            }

        # Print report
        print("=" * 50)
        print("G1 Walking Policy Evaluation")
        print("=" * 50)
        print(f"Vel tracking error: {results['velocity_tracking_error']['mean']:.3f} "
              f"+/- {results['velocity_tracking_error']['std']:.3f} m/s")
        print(f"Cost of Transport:  {results['energy_per_meter']['mean']:.2f} "
              f"+/- {results['energy_per_meter']['std']:.2f}")
        print(f"Gait symmetry:      {results['gait_symmetry']['mean']:.3f}")
        print(f"(Human CoT ~ 0.2, Good robot CoT < 1.0)")

        return results

Tips for Stable Training

1. Observation Normalization (REQUIRED)

# In config
normalize_observations = True
clip_observations = 100.0

2. Reward Scaling

# Clip rewards to avoid outliers
clip_rewards = True
max_reward = 50.0
min_reward = -200.0

3. Action Smoothing Penalty

# If robot is jerky, increase action_rate weight
action_rate_weight = -0.01  # Default
# If still jerky:
action_rate_weight = -0.05  # Increase 5x

4. Gradient Clipping

max_grad_norm = 1.0  # Clip gradient L2 norm
# Prevents gradient explosion when reward has spikes

5. Early Termination Conditions

class G1TerminationsCfg:
    """Episode termination conditions."""

    # Fall (base height too low)
    base_contact = TerminationTermCfg(
        func="illegal_contact",
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            "body_names": ["torso", ".*head.*"],
        },
    )

    # Excessive tilting (> 60 degrees)
    bad_orientation = TerminationTermCfg(
        func="bad_orientation",
        params={"limit_angle": 1.05},  # ~60 degrees
    )

    # Timeout
    time_out = TerminationTermCfg(
        func="time_out",
        time_out=True,  # Don't apply termination penalty for timeout
    )

For more on Isaac Lab, see Isaac Lab: GPU-Accelerated Simulation. For traditional humanoid control, see Humanoid Control Methods.

Robot evaluation

Summary

In this post, we completed the full training pipeline for the Unitree G1:

Loaded G1 URDF with appropriate joint configuration and PD gains
Configured PPO with 4096 envs, adaptive learning rate
Designed 13 reward terms balancing tracking, posture, and smoothness
Trained for ~1 hour on RTX 4090, achieving stable walking
Evaluated velocity tracking, energy efficiency, gait symmetry

Typical results: velocity tracking error < 0.15 m/s, Cost of Transport < 0.8, gait symmetry > 0.9.

Next post — Unitree G1: Terrain Adaptation — extends the policy to complex terrains: stairs, slopes, and rough ground.

References

Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim-to-Real Transfer — Gu et al., 2024
Proximal Policy Optimization Algorithms — Schulman et al., 2017
Isaac Lab: A Unified Framework for Robot Learning — Mittal et al., 2025
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., Science Robotics, 2019

Unitree G1 — Technical Specifications

The Unitree G1 is a compact humanoid well-suited for research and practical applications:

Parameter	Value
Height	1.27 m
Weight	35 kg
Total DOF	23 (legs 12, arms 8, torso 3)
Leg DOF (per side)	6 (hip roll/pitch/yaw, knee, ankle roll/pitch)
Max speed	~2 m/s
Battery life	~2 hours
Onboard compute	Jetson Orin NX
Price	~$16,000 USD

Humanoid robot walking

Step 1: Load G1 URDF into Isaac Lab

Robot Model Configuration

"""
Step 1: Configure Unitree G1 in Isaac Lab.
File: g1_env_cfg.py
"""
import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.assets import ArticulationCfg
from omni.isaac.lab.envs import ManagerBasedRLEnvCfg
from omni.isaac.lab.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.lab.managers import RewardTermCfg, TerminationTermCfg
from omni.isaac.lab.scene import InteractiveSceneCfg
from omni.isaac.lab.terrains import TerrainImporterCfg
from omni.isaac.lab.utils import configclass

@configclass
class G1RobotCfg:
    """Unitree G1 robot configuration."""

    robot = ArticulationCfg(
        prim_path="/World/envs/env_.*/Robot",
        spawn=sim_utils.UsdFileCfg(
            usd_path="datasets/robots/unitree/g1/g1.usd",
            activate_contact_sensors=True,
        ),
        init_state=ArticulationCfg.InitialStateCfg(
            pos=(0.0, 0.0, 0.75),
            joint_pos={
                # === Left leg ===
                "left_hip_yaw": 0.0,
                "left_hip_roll": 0.0,
                "left_hip_pitch": -0.15,
                "left_knee": 0.35,
                "left_ankle_pitch": -0.20,
                "left_ankle_roll": 0.0,
                # === Right leg ===
                "right_hip_yaw": 0.0,
                "right_hip_roll": 0.0,
                "right_hip_pitch": -0.15,
                "right_knee": 0.35,
                "right_ankle_pitch": -0.20,
                "right_ankle_roll": 0.0,
                # Arms: slightly bent
                ".*shoulder.*": 0.0,
                ".*elbow.*": 0.3,
            },
        ),
        actuators={
            "legs": sim_utils.DCMotorCfg(
                joint_names_expr=[
                    ".*hip.*", ".*knee.*", ".*ankle.*"
                ],
                stiffness={
                    ".*hip.*": 150.0,
                    ".*knee.*": 200.0,
                    ".*ankle.*": 40.0,
                },
                damping={
                    ".*hip.*": 5.0,
                    ".*knee.*": 8.0,
                    ".*ankle.*": 2.0,
                },
                effort_limit=100.0,
            ),
            "arms": sim_utils.DCMotorCfg(
                joint_names_expr=[".*shoulder.*", ".*elbow.*"],
                stiffness=80.0,
                damping=4.0,
                effort_limit=50.0,
            ),
        },
    )

Full Environment Configuration

@configclass
class G1FlatWalkEnvCfg(ManagerBasedRLEnvCfg):
    """Complete environment for G1 flat-ground walking."""

    # === Simulation ===
    sim = sim_utils.SimulationCfg(
        dt=0.005,                    # 200 Hz physics
        render_interval=4,
        gravity=(0.0, 0.0, -9.81),
    )

    # === Scene ===
    scene = InteractiveSceneCfg(
        num_envs=4096,               # 4096 parallel environments
        env_spacing=2.5,
    )

    # === Terrain (flat ground) ===
    terrain = TerrainImporterCfg(
        prim_path="/World/ground",
        terrain_type="plane",
        physics_material=sim_utils.RigidBodyMaterialCfg(
            static_friction=1.0,
            dynamic_friction=1.0,
            restitution=0.0,
        ),
    )

    # === Robot ===
    robot = G1RobotCfg.robot

    # === Observations ===
    observations = ObservationGroupCfg(
        policy=ObservationGroupCfg(
            concatenate_terms=True,
            terms={
                "base_ang_vel": ObservationTermCfg(
                    func="base_ang_vel", noise={"mean": 0, "std": 0.05}
                ),
                "projected_gravity": ObservationTermCfg(
                    func="projected_gravity",
                ),
                "velocity_commands": ObservationTermCfg(
                    func="generated_commands",
                    params={"command_name": "base_velocity"},
                ),
                "joint_pos": ObservationTermCfg(
                    func="joint_pos_rel",
                    noise={"mean": 0, "std": 0.01},
                ),
                "joint_vel": ObservationTermCfg(
                    func="joint_vel_rel",
                    noise={"mean": 0, "std": 0.1},
                    scale=0.05,
                ),
                "actions": ObservationTermCfg(
                    func="last_action",
                ),
            },
        ),
    )

    # === Control ===
    decimation = 4                   # Policy at 50 Hz
    episode_length_s = 20.0

    # === Commands ===
    commands = {
        "base_velocity": {
            "ranges": {
                "lin_vel_x": (-0.5, 1.0),    # Forward/backward
                "lin_vel_y": (-0.3, 0.3),     # Lateral
                "ang_vel_z": (-0.5, 0.5),     # Yaw rate
            },
            "resampling_time": 10.0,
        },
    }

Step 2: Configure PPO Hyperparameters

PPO (Proximal Policy Optimization) is the most widely used algorithm for locomotion due to its stability and efficiency:

"""
Step 2: PPO training configuration.
File: g1_ppo_cfg.py
"""

class G1PPORunnerCfg:
    """PPO configuration for G1 walking."""

    # === Algorithm ===
    algorithm = {
        "class_name": "PPO",
        "clip_param": 0.2,           # PPO clip range
        "entropy_coef": 0.01,        # Encourage exploration
        "gamma": 0.99,               # Discount factor
        "lam": 0.95,                 # GAE lambda
        "learning_rate": 1e-3,       # Initial LR
        "lr_schedule": "adaptive",   # Reduce LR when KL divergence is high
        "desired_kl": 0.01,
        "max_grad_norm": 1.0,
        "num_learning_epochs": 5,    # Epochs per update
        "num_mini_batches": 4,
        "value_loss_coef": 1.0,
    }

    # === Policy network ===
    policy = {
        "class_name": "ActorCritic",
        "activation": "elu",
        "actor_hidden_dims": [512, 256, 128],
        "critic_hidden_dims": [512, 256, 128],
        "init_noise_std": 1.0,
    }

    # === Runner ===
    runner = {
        "num_steps_per_env": 24,     # Steps per env per update
        "max_iterations": 5000,      # Total iterations
        "save_interval": 500,
        "log_interval": 10,
    }

Key Hyperparameter Explanations

Parameter	Value	Why?
`num_envs`	4096	Enough data for stable gradients. Increase to 8192 if VRAM allows
`clip_param`	0.2	Standard PPO. Reduce to 0.1 if training is unstable
`entropy_coef`	0.01	Too high = random actions. Too low = local optima
`learning_rate`	1e-3	Adaptive schedule auto-reduces when needed
`actor_hidden_dims`	[512,256,128]	Enough capacity for 23 DOF. Too small = underfitting
`num_steps_per_env`	24	~0.48s real-time per update. Increase to 48 for long-horizon tasks
`init_noise_std`	1.0	High initial exploration, auto-decays

Step 3: Reward Function for G1

Applying knowledge from the reward engineering post:

"""
Step 3: Reward terms for G1 walking.
File: g1_rewards.py
"""
import torch
from omni.isaac.lab.managers import RewardTermCfg

class G1RewardsCfg:
    """Reward configuration for G1 flat walking."""

    # === Primary tracking ===
    lin_vel_xy_tracking = RewardTermCfg(
        func="track_lin_vel_xy_exp",
        weight=1.5,
        params={"std": 0.25, "command_name": "base_velocity"},
    )
    ang_vel_z_tracking = RewardTermCfg(
        func="track_ang_vel_z_exp",
        weight=0.8,
        params={"std": 0.25, "command_name": "base_velocity"},
    )

    # === Posture ===
    flat_orientation = RewardTermCfg(
        func="flat_orientation_l2", weight=-1.0,
    )
    base_height = RewardTermCfg(
        func="base_height_l2",
        weight=-0.5,
        params={"target_height": 0.72},
    )

    # === Regularization ===
    action_rate = RewardTermCfg(
        func="action_rate_l2", weight=-0.01,
    )
    joint_torques = RewardTermCfg(
        func="joint_torques_l2", weight=-5e-5,
    )
    joint_accel = RewardTermCfg(
        func="joint_accel_l2", weight=-1e-4,
    )

    # === Gait ===
    feet_air_time = RewardTermCfg(
        func="feet_air_time",
        weight=0.2,
        params={"threshold": 0.3, "sensor_cfg": "contact_forces"},
    )
    undesired_contacts = RewardTermCfg(
        func="undesired_contacts",
        weight=-1.0,
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            "body_names": [".*knee.*", ".*torso.*", ".*hip.*"],
        },
    )

    # === Termination ===
    termination_penalty = RewardTermCfg(
        func="is_terminated", weight=-200.0,
    )

    # === Velocity penalty ===
    lin_vel_z = RewardTermCfg(
        func="lin_vel_z_l2", weight=-0.5,
    )
    ang_vel_xy = RewardTermCfg(
        func="ang_vel_xy_l2", weight=-0.05,
    )

Training setup GPU

Step 4: Run Training

Training Commands

# Training on RTX 4090 — approximately 45-60 minutes
cd IsaacLab
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless

# With WandB logging
python source/standalone/workflows/rsl_rl/train.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=4096 \
    --max_iterations=5000 \
    --headless \
    --logger wandb \
    --wandb_project g1-locomotion

Typical Training Schedule

Iteration 0-200:    Robot falls constantly, reward ~-100
Iteration 200-500:  Starts standing, reward ~-20 to 0
Iteration 500-1500: First steps! reward increases to 5-10
Iteration 1500-3000: Stable walking, reward ~10-15
Iteration 3000-5000: Fine-tuning, reward ~15-20 (plateau)

Monitoring Training

"""
Script to monitor training progress.
File: monitor_training.py
"""
import matplotlib.pyplot as plt
import json
import glob
import numpy as np

def plot_training_curves(log_dir):
    """Plot key metrics during training."""

    # Metrics to track
    metrics = {
        "reward/total": [],
        "reward/lin_vel_tracking": [],
        "reward/termination": [],
        "loss/policy": [],
        "loss/value": [],
        "diagnostics/mean_episode_length": [],
    }

    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Plot 1: Total reward
    axes[0, 0].set_title("Total Reward")
    axes[0, 0].set_xlabel("Iteration")
    axes[0, 0].set_ylabel("Reward")

    # Plot 2: Velocity tracking
    axes[0, 1].set_title("Velocity Tracking Reward")

    # Plot 3: Episode length (longer = robot doesn't fall)
    axes[0, 2].set_title("Episode Length (s)")

    # Plot 4: Policy loss
    axes[1, 0].set_title("Policy Loss")

    # Plot 5: Value loss
    axes[1, 1].set_title("Value Loss")

    # Plot 6: Termination rate
    axes[1, 2].set_title("Termination Rate")

    plt.tight_layout()
    plt.savefig("training_curves.png", dpi=150)
    print("Saved training_curves.png")

if __name__ == "__main__":
    plot_training_curves("logs/g1_flat_walk")

Step 5: Evaluate the Policy

Running the Trained Policy

# Evaluate with rendering
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=16 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt

# Record video
python source/standalone/workflows/rsl_rl/play.py \
    --task=Isaac-Velocity-Flat-G1-v0 \
    --num_envs=1 \
    --checkpoint=logs/g1_flat_walk/model_5000.pt \
    --video \
    --video_length=300

Quantitative Evaluation

"""
Evaluate walking policy quality.
File: evaluate_g1.py
"""
import torch
import numpy as np

class G1PolicyEvaluator:
    """Evaluate walking policy on multiple criteria."""

    def __init__(self, env, policy, num_eval_steps=1000):
        self.env = env
        self.policy = policy
        self.num_eval_steps = num_eval_steps

    def evaluate(self):
        """Run full evaluation."""
        metrics = {
            "velocity_tracking_error": [],
            "energy_per_meter": [],
            "survival_rate": [],
            "gait_symmetry": [],
            "smoothness": [],
        }

        obs = self.env.reset()
        for step in range(self.num_eval_steps):
            action = self.policy(obs)
            obs, reward, done, info = self.env.step(action)

            # 1. Velocity tracking error
            cmd_vel = info["velocity_command"][:, 0]
            actual_vel = info["base_lin_vel"][:, 0]
            vel_error = torch.abs(cmd_vel - actual_vel).mean().item()
            metrics["velocity_tracking_error"].append(vel_error)

            # 2. Energy consumption (Cost of Transport)
            torques = info["applied_torques"]
            joint_vel = info["joint_velocities"]
            power = torch.sum(torch.abs(torques * joint_vel), dim=1)
            velocity = torch.norm(info["base_lin_vel"][:, :2], dim=1)
            cot = power / (35.0 * 9.81 * torch.clamp(velocity, min=0.1))
            metrics["energy_per_meter"].append(cot.mean().item())

            # 3. Gait symmetry
            left_phase = info["left_foot_contact_time"]
            right_phase = info["right_foot_contact_time"]
            symmetry = 1.0 - torch.abs(
                left_phase - right_phase
            ).mean().item()
            metrics["gait_symmetry"].append(symmetry)

        # Aggregate results
        results = {}
        for key, values in metrics.items():
            results[key] = {
                "mean": np.mean(values),
                "std": np.std(values),
            }

        # Print report
        print("=" * 50)
        print("G1 Walking Policy Evaluation")
        print("=" * 50)
        print(f"Vel tracking error: {results['velocity_tracking_error']['mean']:.3f} "
              f"+/- {results['velocity_tracking_error']['std']:.3f} m/s")
        print(f"Cost of Transport:  {results['energy_per_meter']['mean']:.2f} "
              f"+/- {results['energy_per_meter']['std']:.2f}")
        print(f"Gait symmetry:      {results['gait_symmetry']['mean']:.3f}")
        print(f"(Human CoT ~ 0.2, Good robot CoT < 1.0)")

        return results

Tips for Stable Training

1. Observation Normalization (REQUIRED)

# In config
normalize_observations = True
clip_observations = 100.0

2. Reward Scaling

# Clip rewards to avoid outliers
clip_rewards = True
max_reward = 50.0
min_reward = -200.0

3. Action Smoothing Penalty

# If robot is jerky, increase action_rate weight
action_rate_weight = -0.01  # Default
# If still jerky:
action_rate_weight = -0.05  # Increase 5x

4. Gradient Clipping

max_grad_norm = 1.0  # Clip gradient L2 norm
# Prevents gradient explosion when reward has spikes

5. Early Termination Conditions

class G1TerminationsCfg:
    """Episode termination conditions."""

    # Fall (base height too low)
    base_contact = TerminationTermCfg(
        func="illegal_contact",
        params={
            "sensor_cfg": "contact_forces",
            "threshold": 1.0,
            "body_names": ["torso", ".*head.*"],
        },
    )

    # Excessive tilting (> 60 degrees)
    bad_orientation = TerminationTermCfg(
        func="bad_orientation",
        params={"limit_angle": 1.05},  # ~60 degrees
    )

    # Timeout
    time_out = TerminationTermCfg(
        func="time_out",
        time_out=True,  # Don't apply termination penalty for timeout
    )

For more on Isaac Lab, see Isaac Lab: GPU-Accelerated Simulation. For traditional humanoid control, see Humanoid Control Methods.

Robot evaluation

Summary

In this post, we completed the full training pipeline for the Unitree G1:

Loaded G1 URDF with appropriate joint configuration and PD gains
Configured PPO with 4096 envs, adaptive learning rate
Designed 13 reward terms balancing tracking, posture, and smoothness
Trained for ~1 hour on RTX 4090, achieving stable walking
Evaluated velocity tracking, energy efficiency, gait symmetry

Typical results: velocity tracking error < 0.15 m/s, Cost of Transport < 0.8, gait symmetry > 0.9.

Next post — Unitree G1: Terrain Adaptation — extends the policy to complex terrains: stairs, slopes, and rough ground.

References

Humanoid-Gym: Reinforcement Learning for Humanoid Robot with Zero-Shot Sim-to-Real Transfer — Gu et al., 2024
Proximal Policy Optimization Algorithms — Schulman et al., 2017
Isaac Lab: A Unified Framework for Robot Learning — Mittal et al., 2025
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., Science Robotics, 2019

Unitree G1 — Technical Specifications

Step 1: Load G1 URDF into Isaac Lab

Robot Model Configuration

Full Environment Configuration

Step 2: Configure PPO Hyperparameters

Key Hyperparameter Explanations

Step 3: Reward Function for G1

Step 4: Run Training

Training Commands

Typical Training Schedule

Monitoring Training

Step 5: Evaluate the Policy

Running the Trained Policy

Quantitative Evaluation

Tips for Stable Training

1. Observation Normalization (REQUIRED)

2. Reward Scaling

3. Action Smoothing Penalty

4. Gradient Clipping

5. Early Termination Conditions

Summary

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

Unitree H1: Full-size Humanoid Locomotion Training

Unitree G1: Terrain Adaptation và Robust Walking

RL Locomotion cho Humanoid: Cơ bản và Environment Setup

Unitree G1 — Technical Specifications

Step 1: Load G1 URDF into Isaac Lab

Robot Model Configuration

Full Environment Configuration

Step 2: Configure PPO Hyperparameters

Key Hyperparameter Explanations

Step 3: Reward Function for G1

Step 4: Run Training

Training Commands

Typical Training Schedule

Monitoring Training

Step 5: Evaluate the Policy

Running the Trained Policy

Quantitative Evaluation

Tips for Stable Training

1. Observation Normalization (REQUIRED)

2. Reward Scaling

3. Action Smoothing Penalty

4. Gradient Clipping

5. Early Termination Conditions

Summary

References

Related Posts

Nguyễn Anh Tuấn

Related Posts

Unitree H1: Full-size Humanoid Locomotion Training

Unitree G1: Terrain Adaptation và Robust Walking

RL Locomotion cho Humanoid: Cơ bản và Environment Setup