Production Deployment Guide¶

Warm Cache Volumes¶

The Azure Pipelines agent binary (~550 MB) downloads at container startup, which can take several minutes on a cold pod. The operator solves this with pre-provisioned, exclusively-bound PVC pools.

How It Works¶

When cacheVolumes is configured, the operator:

Pre-provisions maxAgents PVCs per cache template at pool creation time, labeled as free
Assigns one PVC per template exclusively to each new agent pod (labeled assigned, pod name recorded)
Releases the PVC back to the free pool when the pod completes

Because the PVC is reused across pod restarts, the agent cache directory survives pod replacement. Subsequent agent pods start in seconds rather than minutes.

Example configuration:

spec:
  minAgents: 0
  maxAgents: 10

  cacheVolumes:
    - name: agent-cache
      mountPath: /azp/_work
      size: "20Gi"
      # storageClassName: "fast-ssd"  # omit to use the cluster default

    - name: toolcache
      mountPath: /opt/hostedtoolcache
      size: "10Gi"

Each template creates maxAgents PVCs named <pool>-cache-<template>-<slot> (e.g., build-agents-cache-agent-cache-00).

The access mode is ReadWriteOncePod (GA in Kubernetes 1.29), which guarantees exclusive single-pod mounting at the storage layer.

Sizing Recommendations¶

Content	Recommended Size
Agent binary + work directory	10-20 Gi
Tool cache (Node, Python, Go, etc.)	10-30 Gi
Docker layer cache	20-50 Gi
Combined (single volume)	30-80 Gi

Use a storage class backed by SSDs for best job throughput. Set storageClassName to select a specific class.

One-Job-Per-Pod Behavior¶

Each agent pod runs with --once passed to the agent startup script. This causes the agent to:

Register with Azure DevOps
Pick up exactly one job
Deregister and exit cleanly

The operator detects completed pods (Succeeded or Failed phase), releases their PVCs, and deletes them. The reconcile loop then creates new pods to match minAgents or pending job demand.

This model prevents job queue pollution from long-lived agents and gives each job a clean environment.

Scale-to-Zero¶

Setting minAgents: 0 enables true scale-to-zero. When no jobs are pending:

All real agent pods are deleted
The operator registers an offline "dummy" agent in the pool

Azure DevOps queues jobs against the offline dummy agent rather than rejecting them as "no agents available". When a job is detected, the operator removes the dummy and scales up real agents.

The dummy agent is an ADO agent object with no corresponding pod. It relies on the Azure DevOps queuing behavior for offline agents. Verify this works in your ADO organization before relying on it.

Resource Recommendations¶

Minimum per agent pod for a typical CI workload:

agentResources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

For Docker-in-Docker or heavy builds, increase memory limits to 8-16 Gi.

Pod Placement¶

Use nodeSelector, tolerations, and affinity to place agents on dedicated build nodes:

spec:
  nodeSelector:
    kubernetes.io/os: linux
    node-role: build

  tolerations:
    - key: "build-only"
      operator: "Exists"
      effect: "NoSchedule"

Custom Agent Image¶

The operator uses the bundled agent image by default. To use a custom image:

spec:
  agentImage: "myregistry.azurecr.io/azp-agent:latest"
  imagePullSecrets:
    - name: registry-credentials

The image must include the Azure Pipelines agent startup script (start.sh or equivalent). The operator passes --once as a container argument, so the startup script must forward "$@" to run.sh.

Reference Dockerfile in docker/ for the bundled image.

Init Containers¶

Use initContainers to run setup steps before the agent container starts. Init containers share the pod's volumes:

spec:
  initContainers:
    - name: install-tools
      image: busybox
      command: ["sh", "-c", "cp /tools/* /shared/"]
      volumeMounts:
        - name: tools
          mountPath: /shared

Service Account¶

Assign a specific service account if agents need cluster API access:

spec:
  serviceAccountName: "azp-agent"

The service account must exist in the same namespace.

Observability¶

The operator exposes Prometheus metrics on the controller pod:

Metric	Type	Description
`azp_active_agents`	Gauge	Currently running agent pods
`azp_pending_jobs`	Gauge	Jobs waiting in the ADO queue
`azp_available_pvc_slots`	Gauge	Free PVC slots available for scale-up
`azp_reconcile_errors_total`	Counter	Reconcile errors by reason

All metrics carry pool and namespace labels.

Use these alongside standard Kubernetes pod metrics for capacity planning and alerting.

High Availability¶

The operator supports leader election for multi-replica deployments:

# Enable in the controller Deployment
--leader-elect=true

Only one replica is active at a time. Configure at least two replicas to avoid downtime during node failures.

Troubleshooting¶

Agents not scaling up¶

Check the Available condition on the AgentPool: kubectl describe agentpool <name>
Look for errors in controller logs: kubectl logs deploy/azure-devops-agent-operator-controller-manager
Verify the PAT is valid and has the correct scope

PVC pending after scale-up¶

The PVC pool is pre-provisioned at maxAgents size. If PVCs remain Pending, the storage class may not have available capacity or the cluster lacks a default StorageClass. Specify storageClassName explicitly in cacheVolumes.

Dummy agent not accepting jobs¶

The offline dummy agent behavior depends on Azure DevOps queuing semantics. If jobs fail immediately instead of queuing, verify that your ADO organization queues work against offline agents by checking under Agent pools > Jobs in the ADO UI.

Pods evicted or OOMKilled¶

Increase agentResources.limits.memory. For Docker-in-Docker workloads, also check the DinD sidecar container limits.