Docker + K3s on Edge: GitOps for Robot Fleet

Why K3s for Edge Robot Fleet?

Managing Docker on a single edge device is simple — but when your fleet scales to 50, 100, or 500 robots, you need an orchestrator. K3s is a lightweight Kubernetes distribution, developed by Rancher (now part of SUSE) specifically for edge and IoT. The binary is only ~70MB, runs smoothly on ARM64 with 512MB RAM — perfect for robots running Jetson Nano, Raspberry Pi, or any single-board computer.

Unlike full K8s, K3s removes unnecessary edge components (cloud controller, heavy storage drivers) and replaces etcd with SQLite or a much lighter embedded etcd. Result: you get full Kubernetes API without needing beefy servers.

Server rack and edge devices in industrial robot system

Architecture: Control Plane + K3s Agents

The deployment model for robot fleet has 2 layers:

┌──────────────────────────────────────────────┐
│          CONTROL PLANE (Cloud/Server)         │
│  ┌────────────┐ ┌──────────┐ ┌─────────────┐│
│  │ K3s Server │ │ FluxCD   │ │ Prometheus  ││
│  │ (API)      │ │ (GitOps) │ │ + Grafana   ││
│  └────────────┘ └──────────┘ └─────────────┘│
│          │          │              │          │
│          ▼          ▼              ▼          │
│      ┌─────── Tailscale VPN Mesh ──────┐    │
└──────┼──────────────────────────────────┼────┘
       │                                  │
┌──────▼──────┐  ┌──────────────┐  ┌─────▼───────┐
│  Robot #1   │  │  Robot #2    │  │  Robot #N   │
│  K3s Agent  │  │  K3s Agent   │  │  K3s Agent  │
│  ARM64      │  │  ARM64       │  │  ARM64      │
│  Jetson     │  │  RPi 4       │  │  Jetson     │
└─────────────┘  └──────────────┘  └─────────────┘

Control plane runs on a cloud server (could be OCI free tier ARM64) — hosting K3s server, GitOps controller, and monitoring stack. Each robot runs K3s agent, automatically joining the cluster via VPN mesh.

Installing K3s Server and Agent

K3s Server (on cloud)

# Install K3s server with embedded etcd
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
  --cluster-init \
  --tls-san=k3s.vnrobo.com \
  --disable=traefik \
  --write-kubeconfig-mode=644" sh -

# Get token for agents to join
cat /var/lib/rancher/k3s/server/node-token

K3s Agent (on each robot)

# Install K3s agent — just one command
curl -sfL https://get.k3s.io | K3S_URL="https://k3s.vnrobo.com:6443" \
  K3S_TOKEN="<server-token>" \
  INSTALL_K3S_EXEC="agent \
    --node-label=robot-type=welding \
    --node-label=factory=hanoi-01 \
    --node-label=zone=production" sh -

Labels let you target deployments to specific robot groups — for example, update only welding robots at Hanoi factory.

Docker Multi-Stage Build for ARM64

Each robot runs its own container application. Multi-stage build keeps images small for edge:

# === Build stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

COPY src/ ./src/
# Pre-compile Python files
RUN python -m compileall src/

# === Runtime stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim

RUN groupadd -r robot && useradd -r -g robot -d /app robot
COPY --from=builder /install /usr/local
COPY --from=builder /app/src /app/src

WORKDIR /app
USER robot

HEALTHCHECK --interval=30s --timeout=5s \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

CMD ["python", "src/main.py"]

Build multi-architecture and push to registry:

# Create buildx builder
docker buildx create --name fleet-builder --use

# Build ARM64 + AMD64, push to ghcr.io
docker buildx build \
  --platform linux/arm64,linux/amd64 \
  -t ghcr.io/vnrobo/robot-controller:v2.1.0 \
  -t ghcr.io/vnrobo/robot-controller:latest \
  --push .

Image is only ~85MB instead of ~900MB — saves bandwidth during OTA updates over 4G/5G.

FluxCD: GitOps for Robot Fleet

GitOps means Git repository is the single source of truth. You push manifests to Git, FluxCD automatically reconciles cluster state. No SSH into individual robots, no manual kubectl apply commands.

Installing FluxCD

# Bootstrap FluxCD into K3s cluster
flux bootstrap github \
  --owner=vnrobo \
  --repository=fleet-config \
  --path=clusters/production \
  --personal

Git Repository Structure

fleet-config/
├── clusters/
│   └── production/
│       ├── flux-system/         # FluxCD components
│       └── kustomization.yaml   # Entry point
├── apps/
│   ├── robot-controller/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── kustomization.yaml
│   └── telemetry-agent/
│       ├── daemonset.yaml
│       └── kustomization.yaml
└── infrastructure/
    ├── monitoring/
    │   ├── prometheus.yaml
    │   └── grafana.yaml
    └── networking/
        └── tailscale.yaml

Kubernetes Manifests for Robots

# apps/robot-controller/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: robot-controller
  namespace: fleet
spec:
  replicas: 1  # 1 per robot node
  selector:
    matchLabels:
      app: robot-controller
  template:
    metadata:
      labels:
        app: robot-controller
    spec:
      nodeSelector:
        robot-type: welding        # Deploy only to welding robots
      tolerations:
        - key: "edge"
          operator: "Exists"
      containers:
        - name: controller
          image: ghcr.io/vnrobo/robot-controller:v2.1.0
          resources:
            limits:
              memory: "256Mi"
              cpu: "500m"
            requests:
              memory: "128Mi"
              cpu: "250m"
          env:
            - name: ROBOT_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: MQTT_BROKER
              value: "mqtt.vnrobo.com"
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30

Dashboard showing system monitoring and metrics

OTA Updates: Rolling and Canary

Rolling Update — Sequential Update

By default, K8s rolling update updates pods one by one. With robot fleet, you want tighter control:

# apps/robot-controller/deployment.yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1       # Maximum 1 robot offline at a time
      maxSurge: 0             # No new pods (edge has no spare resources)

Canary Deployment — Test on Few Robots First

Use Flagger (FluxCD add-on) for canary deployments:

# apps/robot-controller/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: robot-controller
  namespace: fleet
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: robot-controller
  progressDeadlineSeconds: 600
  analysis:
    interval: 60s
    threshold: 3              # 3 failures → rollback
    iterations: 5             # 5 verification rounds
    metrics:
      - name: error-rate
        thresholdRange:
          max: 1              # Rollback if error > 1%
        interval: 60s
      - name: latency-p99
        thresholdRange:
          max: 500            # Rollback if p99 > 500ms
        interval: 60s

Canary workflow: push new image to Git, FluxCD detects change, Flagger deploys to 10% of robots, monitors metrics, if OK continues rollout, if error auto-rollback. Zero human intervention.

Monitoring: Prometheus + Grafana on Edge

Each robot exposes metrics via /metrics endpoint. Prometheus on control plane scrapes over VPN:

# infrastructure/monitoring/prometheus-scrape.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: robot-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: robot-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - fleet

Key metrics to collect from robots:

Metric	Description	Alert Threshold
`robot_cpu_temp`	CPU temperature	> 80°C
`robot_battery_pct`	Battery remaining	< 20%
`robot_task_latency_ms`	Processing latency	> 200ms (p95)
`robot_connection_status`	Connection status	== 0 (offline)
`robot_error_count`	Cumulative errors	> 10/min

Grafana dashboard displays fleet overview: factory map with real-time robot status, historical performance charts, and alerts for anomalies.

Networking: Tailscale VPN Mesh

Robots in factories are typically behind NAT and complex firewalls. Tailscale (built on WireGuard) creates peer-to-peer mesh VPN — each robot connects directly to control plane without opening ports:

# On each robot — single command
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey=tskey-auth-xxxxx --hostname=robot-$(hostname)

Advantages over traditional VPN:

Zero-config firewall: no need to open factory router ports
Auto-reconnect: robots reconnect automatically after network loss
Granular ACL: control plane accesses robots, but robots don't access each other
MagicDNS: reference robots by name robot-hanoi-01 instead of IP

For self-hosted solution, Headscale is an open-source alternative to Tailscale server.

End-to-End Deployment Process

Complete workflow summary from code to robot:

Developer pushes code
       │
       ▼
GitHub Actions: build + test + push ARM64 image
       │
       ▼
Developer updates image tag in fleet-config repo
       │
       ▼
FluxCD detects change (polls every 60s)
       │
       ▼
Flagger canary deploys: 10% of robots
       │
       ▼
Prometheus checks metrics (5 rounds × 60s)
       │
       ├── OK → Rollout 100% fleet
       └── FAIL → Auto rollback

Entire process requires no SSH into any robot. You just push code and push manifests — the system handles the rest.

Conclusion

Docker + K3s + FluxCD is a powerful combination for robot fleet management. K3s brings Kubernetes power to small edge devices, FluxCD makes Git your single source of truth, and Tailscale solves complex factory networking. With canary deployment and automatic monitoring, you can confidently update hundreds of robots without downtime worries.

If your fleet is under 20 robots, Docker Compose + Watchtower may be enough (see Docker for IoT article). But when you scale, K3s is the natural next step — and you won't regret it.

Kubernetes for Robot Fleet: From Docker to K8s — Overview of Kubernetes architecture for robotics
Deploying IoT Applications with Docker and Docker Compose — Docker fundamentals for edge devices
Robot Fleet Management: Monitoring and Coordination — Fleet management overview

Why K3s for Edge Robot Fleet?

Server rack and edge devices in industrial robot system

Architecture: Control Plane + K3s Agents

The deployment model for robot fleet has 2 layers:

┌──────────────────────────────────────────────┐
│          CONTROL PLANE (Cloud/Server)         │
│  ┌────────────┐ ┌──────────┐ ┌─────────────┐│
│  │ K3s Server │ │ FluxCD   │ │ Prometheus  ││
│  │ (API)      │ │ (GitOps) │ │ + Grafana   ││
│  └────────────┘ └──────────┘ └─────────────┘│
│          │          │              │          │
│          ▼          ▼              ▼          │
│      ┌─────── Tailscale VPN Mesh ──────┐    │
└──────┼──────────────────────────────────┼────┘
       │                                  │
┌──────▼──────┐  ┌──────────────┐  ┌─────▼───────┐
│  Robot #1   │  │  Robot #2    │  │  Robot #N   │
│  K3s Agent  │  │  K3s Agent   │  │  K3s Agent  │
│  ARM64      │  │  ARM64       │  │  ARM64      │
│  Jetson     │  │  RPi 4       │  │  Jetson     │
└─────────────┘  └──────────────┘  └─────────────┘

Installing K3s Server and Agent

K3s Server (on cloud)

# Install K3s server with embedded etcd
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
  --cluster-init \
  --tls-san=k3s.vnrobo.com \
  --disable=traefik \
  --write-kubeconfig-mode=644" sh -

# Get token for agents to join
cat /var/lib/rancher/k3s/server/node-token

K3s Agent (on each robot)

# Install K3s agent — just one command
curl -sfL https://get.k3s.io | K3S_URL="https://k3s.vnrobo.com:6443" \
  K3S_TOKEN="<server-token>" \
  INSTALL_K3S_EXEC="agent \
    --node-label=robot-type=welding \
    --node-label=factory=hanoi-01 \
    --node-label=zone=production" sh -

Labels let you target deployments to specific robot groups — for example, update only welding robots at Hanoi factory.

Docker Multi-Stage Build for ARM64

Each robot runs its own container application. Multi-stage build keeps images small for edge:

# === Build stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

COPY src/ ./src/
# Pre-compile Python files
RUN python -m compileall src/

# === Runtime stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim

RUN groupadd -r robot && useradd -r -g robot -d /app robot
COPY --from=builder /install /usr/local
COPY --from=builder /app/src /app/src

WORKDIR /app
USER robot

HEALTHCHECK --interval=30s --timeout=5s \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

CMD ["python", "src/main.py"]

Build multi-architecture and push to registry:

# Create buildx builder
docker buildx create --name fleet-builder --use

# Build ARM64 + AMD64, push to ghcr.io
docker buildx build \
  --platform linux/arm64,linux/amd64 \
  -t ghcr.io/vnrobo/robot-controller:v2.1.0 \
  -t ghcr.io/vnrobo/robot-controller:latest \
  --push .

Image is only ~85MB instead of ~900MB — saves bandwidth during OTA updates over 4G/5G.

FluxCD: GitOps for Robot Fleet

Installing FluxCD

# Bootstrap FluxCD into K3s cluster
flux bootstrap github \
  --owner=vnrobo \
  --repository=fleet-config \
  --path=clusters/production \
  --personal

Git Repository Structure

fleet-config/
├── clusters/
│   └── production/
│       ├── flux-system/         # FluxCD components
│       └── kustomization.yaml   # Entry point
├── apps/
│   ├── robot-controller/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── kustomization.yaml
│   └── telemetry-agent/
│       ├── daemonset.yaml
│       └── kustomization.yaml
└── infrastructure/
    ├── monitoring/
    │   ├── prometheus.yaml
    │   └── grafana.yaml
    └── networking/
        └── tailscale.yaml

Kubernetes Manifests for Robots

# apps/robot-controller/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: robot-controller
  namespace: fleet
spec:
  replicas: 1  # 1 per robot node
  selector:
    matchLabels:
      app: robot-controller
  template:
    metadata:
      labels:
        app: robot-controller
    spec:
      nodeSelector:
        robot-type: welding        # Deploy only to welding robots
      tolerations:
        - key: "edge"
          operator: "Exists"
      containers:
        - name: controller
          image: ghcr.io/vnrobo/robot-controller:v2.1.0
          resources:
            limits:
              memory: "256Mi"
              cpu: "500m"
            requests:
              memory: "128Mi"
              cpu: "250m"
          env:
            - name: ROBOT_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: MQTT_BROKER
              value: "mqtt.vnrobo.com"
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30

Dashboard showing system monitoring and metrics

OTA Updates: Rolling and Canary

Rolling Update — Sequential Update

By default, K8s rolling update updates pods one by one. With robot fleet, you want tighter control:

# apps/robot-controller/deployment.yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1       # Maximum 1 robot offline at a time
      maxSurge: 0             # No new pods (edge has no spare resources)

Canary Deployment — Test on Few Robots First

Use Flagger (FluxCD add-on) for canary deployments:

# apps/robot-controller/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: robot-controller
  namespace: fleet
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: robot-controller
  progressDeadlineSeconds: 600
  analysis:
    interval: 60s
    threshold: 3              # 3 failures → rollback
    iterations: 5             # 5 verification rounds
    metrics:
      - name: error-rate
        thresholdRange:
          max: 1              # Rollback if error > 1%
        interval: 60s
      - name: latency-p99
        thresholdRange:
          max: 500            # Rollback if p99 > 500ms
        interval: 60s

Canary workflow: push new image to Git, FluxCD detects change, Flagger deploys to 10% of robots, monitors metrics, if OK continues rollout, if error auto-rollback. Zero human intervention.

Monitoring: Prometheus + Grafana on Edge

Each robot exposes metrics via /metrics endpoint. Prometheus on control plane scrapes over VPN:

# infrastructure/monitoring/prometheus-scrape.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: robot-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: robot-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - fleet

Key metrics to collect from robots:

Metric	Description	Alert Threshold
`robot_cpu_temp`	CPU temperature	> 80°C
`robot_battery_pct`	Battery remaining	< 20%
`robot_task_latency_ms`	Processing latency	> 200ms (p95)
`robot_connection_status`	Connection status	== 0 (offline)
`robot_error_count`	Cumulative errors	> 10/min

Grafana dashboard displays fleet overview: factory map with real-time robot status, historical performance charts, and alerts for anomalies.

Networking: Tailscale VPN Mesh

# On each robot — single command
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey=tskey-auth-xxxxx --hostname=robot-$(hostname)

Advantages over traditional VPN:

Zero-config firewall: no need to open factory router ports
Auto-reconnect: robots reconnect automatically after network loss
Granular ACL: control plane accesses robots, but robots don't access each other
MagicDNS: reference robots by name robot-hanoi-01 instead of IP

For self-hosted solution, Headscale is an open-source alternative to Tailscale server.

End-to-End Deployment Process

Complete workflow summary from code to robot:

Developer pushes code
       │
       ▼
GitHub Actions: build + test + push ARM64 image
       │
       ▼
Developer updates image tag in fleet-config repo
       │
       ▼
FluxCD detects change (polls every 60s)
       │
       ▼
Flagger canary deploys: 10% of robots
       │
       ▼
Prometheus checks metrics (5 rounds × 60s)
       │
       ├── OK → Rollout 100% fleet
       └── FAIL → Auto rollback

Entire process requires no SSH into any robot. You just push code and push manifests — the system handles the rest.

Conclusion

If your fleet is under 20 robots, Docker Compose + Watchtower may be enough (see Docker for IoT article). But when you scale, K3s is the natural next step — and you won't regret it.

Kubernetes for Robot Fleet: From Docker to K8s — Overview of Kubernetes architecture for robotics
Deploying IoT Applications with Docker and Docker Compose — Docker fundamentals for edge devices
Robot Fleet Management: Monitoring and Coordination — Fleet management overview

Why K3s for Edge Robot Fleet?

Architecture: Control Plane + K3s Agents

Installing K3s Server and Agent

K3s Server (on cloud)

K3s Agent (on each robot)

Docker Multi-Stage Build for ARM64

FluxCD: GitOps for Robot Fleet

Installing FluxCD

Git Repository Structure

Kubernetes Manifests for Robots

OTA Updates: Rolling and Canary

Rolling Update — Sequential Update

Canary Deployment — Test on Few Robots First

Monitoring: Prometheus + Grafana on Edge

Networking: Tailscale VPN Mesh

End-to-End Deployment Process

Conclusion

Related Articles

Nguyễn Anh Tuấn

Related Posts

Kubernetes cho Robot Fleet: Orchestration ở quy mô lớn

Multi-robot Coordination: Thuật toán phân công task

Wheeled Humanoid: Tương lai robot logistics và warehouse

Why K3s for Edge Robot Fleet?

Architecture: Control Plane + K3s Agents

Installing K3s Server and Agent

K3s Server (on cloud)

K3s Agent (on each robot)

Docker Multi-Stage Build for ARM64

FluxCD: GitOps for Robot Fleet

Installing FluxCD

Git Repository Structure

Kubernetes Manifests for Robots

OTA Updates: Rolling and Canary

Rolling Update — Sequential Update

Canary Deployment — Test on Few Robots First

Monitoring: Prometheus + Grafana on Edge

Networking: Tailscale VPN Mesh

End-to-End Deployment Process

Conclusion

Related Articles

Nguyễn Anh Tuấn

Related Posts

Kubernetes cho Robot Fleet: Orchestration ở quy mô lớn

Multi-robot Coordination: Thuật toán phân công task

Wheeled Humanoid: Tương lai robot logistics và warehouse