Skip to main content
Enterprise

High Availability

High availability (HA) in NetBox Enterprise is delivered in phases. Phase 1, in NetBox Enterprise 2.2, provides single-cluster high availability: a multi-node cluster that recovers automatically from the loss of any single fault domain - a node or an entire availability zone - with NetBox staying available and the cluster repairing itself without manual intervention.

The recovery mechanism is the same whether the lost fault domain is a node or an availability zone. The difference is where the nodes sit - a single location or spread across availability zones - and which topology labels group them.

Phase 1 of HA support in NetBox Enterprise 2.2 requires customer-provided external services - object storage, PostgreSQL, and Redis - for data resiliency. Running these services in-cluster under NetBox Labs management is planned for a later phase. HA uses the Multi-Node Deployment installation procedure.

What Phase 1 covers

Phase 1 delivers high availability within a single cluster: automated failover when a node or an availability zone is lost. It is not cross-site disaster recovery - replicating to an independent deployment elsewhere, with one site taking over for another, is a separate capability. The practical boundary is latency: a single cluster can span nodes, availability zones, or low-latency regions, but not geographically distant high-latency regions, because etcd quorum needs low inter-node latency.

High availability on air-gapped installations is not yet supported.

What to expect on node or AZ loss

When a node fails - or an entire availability zone, which takes down every node in that zone - a correctly configured HA deployment recovers automatically through several mechanisms working together:

  • Faster failure detection. Node-down detection and pod eviction timeouts are tuned well below Kubernetes defaults, so recovery starts in tens of seconds rather than minutes.
  • Spread across fault domains. NetBox web replicas are forced onto distinct nodes (a hard spread) on multi-node installs that use external S3; other workloads use a softer best-effort spread. Once nodes carry zone labels, the same spread operates at the availability-zone level (see Multi-AZ placement). Either way, the loss of one fault domain never takes down every replica of a service.
  • Per-node scaling. Stateful and pipeline components scale their replica counts to match node count (up to 3) automatically as nodes join.
  • HA-readiness signal. If a cluster has 3 or more nodes but HA has not been enabled, NetBox Enterprise surfaces a warning condition and reports itself not-ready until HA setup is complete.

After losing a node, a three-node cluster runs at two of three controllers - still functional, but with no remaining fault tolerance. Replace the failed node promptly.

Recovery time

ScenarioRecovery
Fresh 2.2+ installAbout one minute - up to ~28 seconds to detect the node is down, then ~30 seconds to evict and reschedule its pods
Upgraded from a pre-2.2 releaseAbout five minutes - the faster eviction timeout can only be applied at fresh install, so upgraded clusters keep the slower default for that step (detection still improves)

The one-minute figure is convergence - detecting the failure and rescheduling pods. Users stay served throughout, because a surviving replica on another fault domain keeps answering (this is what the hard spread buys). A replacement pod takes longer to fully start on its own - a NetBox web pod cold-starts in roughly 2-4 minutes through its dependency-gated init containers and app boot - but on a single node or zone loss no one sees that, because the survivor is already serving. The cold-start only becomes user-visible if every replica is lost at once, which the spread exists to prevent.

Network flap sensitivity

Node-down detection is tuned aggressively to favor fast recovery. A network interruption lasting more than ~20 seconds on a node is enough to trigger pod eviction, which can cause false-positive failovers on links with intermittent connectivity.

Restoring failover speed after an in-place upgrade

A cluster upgraded in place from a pre-2.2 release does not pick up all of the failover tuning automatically, so it recovers more slowly than a fresh 2.2+ install until you act:

  • Node-down detection: run sudo systemctl restart k0scontroller on each node after the upgrade to apply the faster detection settings.
  • Pod eviction: the faster eviction timeout is fixed when the cluster is first created and cannot be changed on an existing Embedded Cluster, so eviction stays at the Kubernetes default (about five minutes) until the cluster is reinstalled.

Fresh 2.2+ installs have both already and need no action.

Requirements

A deployment delivers the HA promise only when all of the following hold. Anything less is a single-node (non-HA) configuration, regardless of what the install UI indicates.

  1. Three nodes, all registered as controllers, so etcd holds quorum. Losing one leaves two of three - the cluster keeps running but has no remaining fault tolerance, so replace a lost node promptly.
  2. External S3-compatible object storage for NetBox media and scripts. Without it, media lives on a node-local volume that pins NetBox to one node and breaks recovery.
  3. Externally managed PostgreSQL.
  4. Externally managed Redis.

The external database, Redis, and object storage must themselves be highly available - that is your responsibility. NetBox Enterprise depends on them; it does not make them HA.

PostgreSQL:

  • AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL, or self-hosted PostgreSQL with streaming replication (Patroni, pg_auto_failover)
  • Requires three databases: netbox, diode, hydra

Redis:

  • AWS ElastiCache (Redis OSS), Google Memorystore, Azure Cache for Redis, or self-hosted Redis with Sentinel
  • Sentinel-based Redis is required for durable workloads (the Diode ingestion queue)

Object storage: any S3-compatible service reachable from all nodes. See Storage Options and Storage Installation.

See the external database guide for PostgreSQL and Redis configuration details.

Placement options

Single-location: three nodes in one location. Survives the loss of any single node.

Multi-AZ: three nodes, one per availability zone, in a single region with low-latency inter-AZ networking. Survives the loss of any single availability zone. The deployment is identical to single-location - three nodes, all controllers - but each node is provisioned in a different zone. You provision the hosts across zones yourself; no cloud-provider API integration is required.

Why three availability zones (not two)

The Kubernetes control plane uses etcd, which requires a quorum majority. With two availability zones, one holds more etcd members than the other; if the larger zone fails, the survivors are in the minority and the control plane halts. With three zones (one node each), losing any single zone leaves two of three members - a majority - and the cluster keeps running.

Zone labels on Embedded Cluster

Embedded Cluster has no first-class way to apply per-node topology labels during the join process. To get availability-zone-level spread, label the nodes manually after they join, following Topology-Aware Node Labels to extend the cluster configuration. Until the nodes carry zone labels, spread is node-level only. In the standard one-node-per-zone topology that still places replicas in separate zones, but applying the labels keeps the guarantee correct if a zone ever holds more than one node.

Infrastructure requirements

Node topology

RequirementDetail
Minimum nodes3
Node roleAll three nodes must be control plane instances
PlacementEach node on a separate physical or virtual machine; for multi-AZ, one node per zone across exactly 3 availability zones in the same region
Resources per node8 vCPU, 24 GB RAM, 100 GB SSD - see System Requirements
Cluster capacitySurviving nodes must have enough headroom to absorb the workloads from a failed node

Network

Latency: low inter-node latency is required for etcd quorum and Kubernetes control-plane health. Co-locate nodes (same datacenter or rack) for single-location deployments. For multi-AZ, use zones within a single region - typical intra-region inter-AZ latency (about 1-5 ms) is within range, while distant high-latency regions are not supported (that is disaster recovery, not HA).

Bandwidth: 1 Gbps or faster NIC recommended for inter-node cluster traffic.

Required ports (inter-node): open these between all cluster nodes, in addition to the standard single-node ports:

PortProtocolPurpose
6443TCPKubernetes API server
2379-2380TCPetcd (control plane state)
10250TCPKubelet API
4789UDPVXLAN overlay (Calico)
179TCPBGP (if Calico uses BGP mode)

Inter-node security

  • Control plane: etcd and the Kubernetes API server communicate over TLS, with certificates managed by k0s.
  • Data-plane (pod-to-pod) traffic: carried by the Calico overlay network. Overlay encryption (WireGuard) is not enabled by default, so run all cluster nodes on a trusted private network - for example a dedicated VLAN, VPC, or subnet with security groups - rather than across untrusted links.
  • Admin Console (port 30000): TLS; restrict access to administrators only.

Nodes must be reachable from each other by IP address. NAT between nodes is not supported.

Load balancer

A stateless load balancer in front of all nodes provides a single entry point and routes around failed nodes using health checks. The cluster handles failover internally - no failover logic is needed in the load balancer. See Load Balancer Configuration.

Operator high availability

The operator can run multiple replicas with leader election, so the loss of an operator pod does not pause reconciliation and operator upgrades can be zero-downtime. This is separate from the cluster failover described above - NetBox and node or zone recovery do not depend on it - and it is opt-in (off by default).

Pre-deployment checklist

  • 3 nodes, each on a separate physical or virtual machine
  • All 3 nodes registered as control plane instances
  • Each node meets hardware requirements (8 vCPU, 24 GB RAM, 100 GB SSD)
  • External, highly available PostgreSQL configured and reachable from all nodes
  • External, highly available Redis (with Sentinel) configured and reachable from all nodes
  • External S3-compatible storage configured and reachable from all nodes
  • Inter-node ports open (6443, 2379-2380, 10250, 4789 UDP, 179 TCP)
  • Low inter-node latency (co-located, or intra-region for multi-AZ)
  • Load balancer configured with health checks against all nodes
  • TLS certificate includes SANs for all node hostnames and the load balancer hostname

For multi-AZ, also:

  • Nodes provisioned across exactly 3 availability zones (one per zone), same region
  • Zone topology labels applied to each node (see Topology-Aware Node Labels)