Skip to main content
Cloud

Scaling Best Practices

NetBox Validation uses three engines with different scaling characteristics. Understanding how each engine scales helps you design policies, schedule runs, and size infrastructure for deployments from hundreds to tens of thousands of devices.

Policy Design by Engine

The most important scaling decision is how you structure policies per engine.

Intent Engine

The intent engine queries the NetBox data model directly and scales linearly with device count. More devices means proportionally more time, with no fixed per-policy overhead.

Design principle: Segment freely by role. A spine policy, a leaf policy, and a server policy will each run faster than one combined policy covering all roles. More policies with narrower scope means faster individual runs and fewer irrelevant checks.

Approximate throughput:

  • Lightweight policies (5-6 checks): ~25 ms per device
  • Heavy policies (10+ checks with topology queries): ~200-250 ms per device

At 1,000 devices, a lightweight intent policy completes in under 30 seconds. At 2,000 devices, under a minute.

Config Engine

The config engine renders device configurations and sends them to the config analysis engine for structural analysis. It has significant fixed overhead per policy (~30-40 seconds for snapshot construction) plus linear per-device cost for config rendering (~150-300 ms per device at scale).

Design principle: Consolidate rules into fewer policies. Adding more rules to an existing config policy is nearly free. Splitting the same rules across multiple policies multiplies the fixed overhead.

Config PoliciesRules Per PolicyTotal RulesApproximate Time
13030~35-40s
31030~105-120s
6530~210-240s

Only split config policies when different sites need different config rule sets.

A cross-policy config cache reduces overhead when running multiple config-dependent policies in the same session. The first policy pays the snapshot cost; subsequent policies reuse the cached snapshot.

Graph Engine

The graph engine builds an in-memory infrastructure dependency graph from NetBox data models. It loads all active devices at the scoped sites to trace complete power, network, and circuit dependency chains -- regardless of which devices are targeted by the run.

Design principle: Keep site-scoped, don't role-segment. Device targeting does not reduce graph engine time because the full site graph must be built regardless. Role-segmented runs are counterproductive -- three role-segmented runs each trigger a separate graph build and can take 5x longer than a single site-scoped run.

Run ScopeEngine Behavior
Full siteBuilds graph once, evaluates all devices
Role-targetedBuilds full site graph anyway, filters results to targeted role
3 role-segmented runsBuilds full site graph 3 separate times

Scoping Policies for Scale

Scope Every Policy to a Site

This is the single most important scaling recommendation.

An unscoped policy at a 10,000-device organization forces every engine to process all 10,000 devices. Site-scoped policies process only the devices at each site, typically 100-1,000 devices.

For graph engine policies, the impact is especially significant: the graph builder loads all devices at scoped sites. An unscoped policy at a multi-site organization loads devices from every site into one graph.

Avoid: 1 policy x 10,000 devices = 1 very slow run
Better: 10 policies x 1,000 devices each = 10 fast runs (parallelizable)

Role Segmentation: Intent Only

Role-based policies are excellent for intent checks. Different device roles have different check profiles -- spine BGP checks don't apply to servers. Creating role-specific intent policies:

  • Reduces per-run device count (linear speedup)
  • Avoids running irrelevant checks (fewer false positives)
  • Enables targeted scheduling (spine policies on every change, server policies weekly)

Do not role-segment graph or mixed-engine policies. Each role-segmented run triggers a separate graph build and config snapshot. The sum of role-segmented runs is significantly worse than a single site-scoped run.

Separate Graph Checks Into Dedicated Policies

Graph checks have a unique cost profile (roughly constant per site). Separating them from intent-only rules lets you:

  • Schedule graph audits less frequently (weekly vs. per-merge)
  • Avoid graph overhead when only intent checks are needed
  • Clearly attribute execution time to each engine

Targeted Runs and Change Validation

When Device Targeting Helps

Device targeting (target_devices, target_sites, target_tags on a run) narrows the scope to specific devices. Its benefit depends on the engine:

EngineBenefit from Device Targeting
IntentLinear savings proportional to device reduction
ConfigPartial -- fewer configs to render, but fixed snapshot overhead remains
GraphNone -- loads all site devices for dependency analysis regardless

Pre-Change Validation

For branch-merge and change-request triggers, targeting only the devices affected by the change provides fast feedback for intent checks. A merge touching 10 devices at a 1,000-device site completes intent validation in under a second.

Config and graph policies benefit less from targeting. Config validation still pays the fixed snapshot overhead. Graph validation loads the full site regardless.

When to Use Full-Site Runs

  • Graph-heavy policies: targeting adds scope-intersection overhead without reducing graph build time
  • Scheduled compliance audits: full-site runs ensure complete compliance score coverage
  • After infrastructure changes: when you need compliance scores updated for all devices

Scheduling Recommendations

Match check frequency to engine cost and urgency. The right schedule depends on your infrastructure size, complexity, and operational needs.

Tier 1: Every Change (Seconds)

  • What: Intent-only policies with device targeting
  • When: Branch merge trigger or change request trigger
  • Scope: Only changed devices
  • Typical time: 1-10 seconds per policy
  • Example policies: Data Quality, Addressing & IPAM, Naming & Standards

Tier 2: Daily (Minutes)

  • What: Intent + config baseline per site
  • When: Daily cron schedule
  • Scope: Full site
  • Typical time: 2-5 minutes per site (at ~250 devices)
  • Example policies: Config Analysis Baseline, Leaf/Spine Intent Baseline
# Cron schedule on policy
schedule: "0 6 * * *" # Daily at 6 AM

Tier 3: Weekly (10-30 Minutes)

  • What: Full compliance + graph resilience audit
  • When: Weekly cron schedule
  • Scope: Full site
  • Typical time: 10-30 minutes per site (at ~250 devices)
  • Example policies: CLOS Fabric Design, Power/Network Resilience, compliance framework packs
# Cron schedule on policy
schedule: "0 2 * * 0" # Weekly on Sunday at 2 AM

These tiers are approximate starting points. Larger sites or policies with more checks will take proportionally longer. Adjust based on your observed run times.


Scaling with Multiple Workers

Worker Parallelism

Site-scoped policies naturally distribute across RQ workers. Each worker processes one policy run at a time, with no shared state between workers.

Total wall-clock time ~ Total execution time / Number of workers

If a full validation suite takes 20 minutes on one worker, adding a second worker brings it to ~10 minutes. Four workers bring it to ~5 minutes.

Config Analysis Engine Capacity

The config analysis engine processes one snapshot at a time per instance. Scale by adding more instances:

  • 1 instance per 2-3 concurrent config-engine policies
  • Each instance: ~2 vCPU + 4 GB RAM
  • Workers x instances = maximum concurrent config engine throughput

Contact NetBox Labs for guidance on configuring multiple config analysis engine instances for your deployment.


Summary: Quick Reference

GuidelineWhy
Scope every policy to a sitePrevents any engine from processing your entire fleet at once
Role-segment intent policiesLinear speedup, targeted checks
Don't role-segment graph policiesEach segment rebuilds the full site graph
Consolidate config rules into fewer policiesAvoids multiplying fixed snapshot overhead
Separate graph checks from intent checksEnables independent scheduling by cost tier
Use device targeting for pre-change intent validationSub-second feedback on changed devices
Use full-site runs for graph and compliance auditsGraph engine needs the full site regardless
Add workers for parallelismLinear wall-clock reduction
Add config engine instances for config throughputEnables concurrent config analysis