← Back to Blog
navigationdevopsfleetkubernetes

Docker + K3s on Edge: GitOps for Robot Fleet

Guide to deploying Docker and K3s on edge devices — manage, OTA update, and monitor hundreds of robots with GitOps workflow.

Nguyen Anh Tuan10 tháng 3, 20267 min read
Docker + K3s on Edge: GitOps for Robot Fleet

Why K3s for Edge Robot Fleet?

Managing Docker on a single edge device is simple — but when your fleet scales to 50, 100, or 500 robots, you need an orchestrator. K3s is a lightweight Kubernetes distribution, developed by Rancher (now part of SUSE) specifically for edge and IoT. The binary is only ~70MB, runs smoothly on ARM64 with 512MB RAM — perfect for robots running Jetson Nano, Raspberry Pi, or any single-board computer.

Unlike full K8s, K3s removes unnecessary edge components (cloud controller, heavy storage drivers) and replaces etcd with SQLite or a much lighter embedded etcd. Result: you get full Kubernetes API without needing beefy servers.

Server rack and edge devices in industrial robot system

Architecture: Control Plane + K3s Agents

The deployment model for robot fleet has 2 layers:

┌──────────────────────────────────────────────┐
│          CONTROL PLANE (Cloud/Server)         │
│  ┌────────────┐ ┌──────────┐ ┌─────────────┐│
│  │ K3s Server │ │ FluxCD   │ │ Prometheus  ││
│  │ (API)      │ │ (GitOps) │ │ + Grafana   ││
│  └────────────┘ └──────────┘ └─────────────┘│
│          │          │              │          │
│          ▼          ▼              ▼          │
│      ┌─────── Tailscale VPN Mesh ──────┐    │
└──────┼──────────────────────────────────┼────┘
       │                                  │
┌──────▼──────┐  ┌──────────────┐  ┌─────▼───────┐
│  Robot #1   │  │  Robot #2    │  │  Robot #N   │
│  K3s Agent  │  │  K3s Agent   │  │  K3s Agent  │
│  ARM64      │  │  ARM64       │  │  ARM64      │
│  Jetson     │  │  RPi 4       │  │  Jetson     │
└─────────────┘  └──────────────┘  └─────────────┘

Control plane runs on a cloud server (could be OCI free tier ARM64) — hosting K3s server, GitOps controller, and monitoring stack. Each robot runs K3s agent, automatically joining the cluster via VPN mesh.

Installing K3s Server and Agent

K3s Server (on cloud)

# Install K3s server with embedded etcd
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
  --cluster-init \
  --tls-san=k3s.vnrobo.com \
  --disable=traefik \
  --write-kubeconfig-mode=644" sh -

# Get token for agents to join
cat /var/lib/rancher/k3s/server/node-token

K3s Agent (on each robot)

# Install K3s agent — just one command
curl -sfL https://get.k3s.io | K3S_URL="https://k3s.vnrobo.com:6443" \
  K3S_TOKEN="<server-token>" \
  INSTALL_K3S_EXEC="agent \
    --node-label=robot-type=welding \
    --node-label=factory=hanoi-01 \
    --node-label=zone=production" sh -

Labels let you target deployments to specific robot groups — for example, update only welding robots at Hanoi factory.

Docker Multi-Stage Build for ARM64

Each robot runs its own container application. Multi-stage build keeps images small for edge:

# === Build stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

COPY src/ ./src/
# Pre-compile Python files
RUN python -m compileall src/

# === Runtime stage ===
FROM --platform=$TARGETPLATFORM python:3.11-slim

RUN groupadd -r robot && useradd -r -g robot -d /app robot
COPY --from=builder /install /usr/local
COPY --from=builder /app/src /app/src

WORKDIR /app
USER robot

HEALTHCHECK --interval=30s --timeout=5s \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

CMD ["python", "src/main.py"]

Build multi-architecture and push to registry:

# Create buildx builder
docker buildx create --name fleet-builder --use

# Build ARM64 + AMD64, push to ghcr.io
docker buildx build \
  --platform linux/arm64,linux/amd64 \
  -t ghcr.io/vnrobo/robot-controller:v2.1.0 \
  -t ghcr.io/vnrobo/robot-controller:latest \
  --push .

Image is only ~85MB instead of ~900MB — saves bandwidth during OTA updates over 4G/5G.

FluxCD: GitOps for Robot Fleet

GitOps means Git repository is the single source of truth. You push manifests to Git, FluxCD automatically reconciles cluster state. No SSH into individual robots, no manual kubectl apply commands.

Installing FluxCD

# Bootstrap FluxCD into K3s cluster
flux bootstrap github \
  --owner=vnrobo \
  --repository=fleet-config \
  --path=clusters/production \
  --personal

Git Repository Structure

fleet-config/
├── clusters/
│   └── production/
│       ├── flux-system/         # FluxCD components
│       └── kustomization.yaml   # Entry point
├── apps/
│   ├── robot-controller/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── kustomization.yaml
│   └── telemetry-agent/
│       ├── daemonset.yaml
│       └── kustomization.yaml
└── infrastructure/
    ├── monitoring/
    │   ├── prometheus.yaml
    │   └── grafana.yaml
    └── networking/
        └── tailscale.yaml

Kubernetes Manifests for Robots

# apps/robot-controller/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: robot-controller
  namespace: fleet
spec:
  replicas: 1  # 1 per robot node
  selector:
    matchLabels:
      app: robot-controller
  template:
    metadata:
      labels:
        app: robot-controller
    spec:
      nodeSelector:
        robot-type: welding        # Deploy only to welding robots
      tolerations:
        - key: "edge"
          operator: "Exists"
      containers:
        - name: controller
          image: ghcr.io/vnrobo/robot-controller:v2.1.0
          resources:
            limits:
              memory: "256Mi"
              cpu: "500m"
            requests:
              memory: "128Mi"
              cpu: "250m"
          env:
            - name: ROBOT_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: MQTT_BROKER
              value: "mqtt.vnrobo.com"
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30

Dashboard showing system monitoring and metrics

OTA Updates: Rolling and Canary

Rolling Update — Sequential Update

By default, K8s rolling update updates pods one by one. With robot fleet, you want tighter control:

# apps/robot-controller/deployment.yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1       # Maximum 1 robot offline at a time
      maxSurge: 0             # No new pods (edge has no spare resources)

Canary Deployment — Test on Few Robots First

Use Flagger (FluxCD add-on) for canary deployments:

# apps/robot-controller/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: robot-controller
  namespace: fleet
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: robot-controller
  progressDeadlineSeconds: 600
  analysis:
    interval: 60s
    threshold: 3              # 3 failures → rollback
    iterations: 5             # 5 verification rounds
    metrics:
      - name: error-rate
        thresholdRange:
          max: 1              # Rollback if error > 1%
        interval: 60s
      - name: latency-p99
        thresholdRange:
          max: 500            # Rollback if p99 > 500ms
        interval: 60s

Canary workflow: push new image to Git, FluxCD detects change, Flagger deploys to 10% of robots, monitors metrics, if OK continues rollout, if error auto-rollback. Zero human intervention.

Monitoring: Prometheus + Grafana on Edge

Each robot exposes metrics via /metrics endpoint. Prometheus on control plane scrapes over VPN:

# infrastructure/monitoring/prometheus-scrape.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: robot-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: robot-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - fleet

Key metrics to collect from robots:

Metric Description Alert Threshold
robot_cpu_temp CPU temperature > 80°C
robot_battery_pct Battery remaining < 20%
robot_task_latency_ms Processing latency > 200ms (p95)
robot_connection_status Connection status == 0 (offline)
robot_error_count Cumulative errors > 10/min

Grafana dashboard displays fleet overview: factory map with real-time robot status, historical performance charts, and alerts for anomalies.

Networking: Tailscale VPN Mesh

Robots in factories are typically behind NAT and complex firewalls. Tailscale (built on WireGuard) creates peer-to-peer mesh VPN — each robot connects directly to control plane without opening ports:

# On each robot — single command
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey=tskey-auth-xxxxx --hostname=robot-$(hostname)

Advantages over traditional VPN:

For self-hosted solution, Headscale is an open-source alternative to Tailscale server.

End-to-End Deployment Process

Complete workflow summary from code to robot:

Developer pushes code
       │
       ▼
GitHub Actions: build + test + push ARM64 image
       │
       ▼
Developer updates image tag in fleet-config repo
       │
       ▼
FluxCD detects change (polls every 60s)
       │
       ▼
Flagger canary deploys: 10% of robots
       │
       ▼
Prometheus checks metrics (5 rounds × 60s)
       │
       ├── OK → Rollout 100% fleet
       └── FAIL → Auto rollback

Entire process requires no SSH into any robot. You just push code and push manifests — the system handles the rest.

Conclusion

Docker + K3s + FluxCD is a powerful combination for robot fleet management. K3s brings Kubernetes power to small edge devices, FluxCD makes Git your single source of truth, and Tailscale solves complex factory networking. With canary deployment and automatic monitoring, you can confidently update hundreds of robots without downtime worries.

If your fleet is under 20 robots, Docker Compose + Watchtower may be enough (see Docker for IoT article). But when you scale, K3s is the natural next step — and you won't regret it.


Related Articles

Related Posts

Multi-robot Coordination: Thuật toán phân công task
fleetamrprogramming

Multi-robot Coordination: Thuật toán phân công task

Các thuật toán phân công nhiệm vụ cho đội robot — từ Hungarian algorithm, auction-based đến RL-based task allocation.

20/3/202612 min read
Wheeled Humanoid: Tương lai robot logistics và warehouse
humanoidfleetamr

Wheeled Humanoid: Tương lai robot logistics và warehouse

Robot hình người trên bánh xe — tại sao thiết kế hybrid này đang thay đổi ngành logistics và vận hành kho hàng.

3/3/202611 min read
Tự động hóa nhà máy điện tử Bắc Ninh: Bài học thực tế
amrautomationfleet

Tự động hóa nhà máy điện tử Bắc Ninh: Bài học thực tế

Phân tích quá trình tự động hóa tại các nhà máy điện tử FDI ở Bắc Ninh — thách thức, giải pháp và bài học cho doanh nghiệp Việt.

19/2/202610 min read