Cloud

Scaling Best Practices

NetBox Validation uses three engines with different scaling characteristics. Understanding how each engine scales helps you design policies, schedule runs, and size infrastructure for deployments from hundreds to tens of thousands of devices.

Policy Design by Engine

The most important scaling decision is how you structure policies per engine.

Intent Engine

The intent engine queries the NetBox data model directly and scales linearly with device count. More devices means proportionally more time, with no fixed per-policy overhead.

Design principle: Segment freely by role. A spine policy, a leaf policy, and a server policy will each run faster than one combined policy covering all roles. More policies with narrower scope means faster individual runs and fewer irrelevant checks.

Approximate throughput:

Lightweight policies (5-6 checks): ~25 ms per device
Heavy policies (10+ checks with topology queries): ~200-250 ms per device

At 1,000 devices, a lightweight intent policy completes in under 30 seconds. At 2,000 devices, under a minute.

Config Engine

The config engine renders device configurations and sends them to the config analysis engine for structural analysis. It has significant fixed overhead per policy (~30-40 seconds for snapshot construction) plus linear per-device cost for config rendering (~150-300 ms per device at scale).

Design principle: Consolidate rules into fewer policies. Adding more rules to an existing config policy is nearly free. Splitting the same rules across multiple policies multiplies the fixed overhead.

Config Policies	Rules Per Policy	Total Rules	Approximate Time
1	30	30	~35-40s
3	10	30	~105-120s
6	5	30	~210-240s

Only split config policies when different sites need different config rule sets.

A cross-policy config cache reduces overhead when running multiple config-dependent policies in the same session. The first policy pays the snapshot cost; subsequent policies reuse the cached snapshot.

Graph Engine

The graph engine builds an in-memory infrastructure dependency graph from NetBox data models. It loads all active devices at the scoped sites to trace complete power, network, and circuit dependency chains -- regardless of which devices are targeted by the run.

Design principle: Keep site-scoped, don't role-segment. Device targeting does not reduce graph engine time because the full site graph must be built regardless. Role-segmented runs are counterproductive -- three role-segmented runs each trigger a separate graph build and can take 5x longer than a single site-scoped run.

Run Scope	Engine Behavior
Full site	Builds graph once, evaluates all devices
Role-targeted	Builds full site graph anyway, filters results to targeted role
3 role-segmented runs	Builds full site graph 3 separate times

Scoping Policies for Scale

Scope Every Policy to a Site

This is the single most important scaling recommendation.

An unscoped policy at a 10,000-device organization forces every engine to process all 10,000 devices. Site-scoped policies process only the devices at each site, typically 100-1,000 devices.

For graph engine policies, the impact is especially significant: the graph builder loads all devices at scoped sites. An unscoped policy at a multi-site organization loads devices from every site into one graph.

Avoid: 1 policy x 10,000 devices = 1 very slow run
Better: 10 policies x 1,000 devices each = 10 fast runs (parallelizable)

Role Segmentation: Intent Only

Role-based policies are excellent for intent checks. Different device roles have different check profiles -- spine BGP checks don't apply to servers. Creating role-specific intent policies:

Reduces per-run device count (linear speedup)
Avoids running irrelevant checks (fewer false positives)
Enables targeted scheduling (spine policies on every change, server policies weekly)

Do not role-segment graph or mixed-engine policies. Each role-segmented run triggers a separate graph build and config snapshot. The sum of role-segmented runs is significantly worse than a single site-scoped run.

Separate Graph Checks Into Dedicated Policies

Graph checks have a unique cost profile (roughly constant per site). Separating them from intent-only rules lets you:

Schedule graph audits less frequently (weekly vs. per-merge)
Avoid graph overhead when only intent checks are needed
Clearly attribute execution time to each engine

Targeted Runs and Change Validation

When Device Targeting Helps

Device targeting (target_devices, target_sites, target_tags on a run) narrows the scope to specific devices. Its benefit depends on the engine:

Engine	Benefit from Device Targeting
Intent	Linear savings proportional to device reduction
Config	Partial -- fewer configs to render, but fixed snapshot overhead remains
Graph	None -- loads all site devices for dependency analysis regardless

Pre-Change Validation

For branch-merge and change-request triggers, targeting only the devices affected by the change provides fast feedback for intent checks. A merge touching 10 devices at a 1,000-device site completes intent validation in under a second.

Config and graph policies benefit less from targeting. Config validation still pays the fixed snapshot overhead. Graph validation loads the full site regardless.

When to Use Full-Site Runs

Graph-heavy policies: targeting adds scope-intersection overhead without reducing graph build time
Scheduled compliance audits: full-site runs ensure complete compliance score coverage
After infrastructure changes: when you need compliance scores updated for all devices

Scheduling Recommendations

Match check frequency to engine cost and urgency. The right schedule depends on your infrastructure size, complexity, and operational needs.

Tier 1: Every Change (Seconds)

What: Intent-only policies with device targeting
When: Branch merge trigger or change request trigger
Scope: Only changed devices
Typical time: 1-10 seconds per policy
Example policies: Data Quality, Addressing & IPAM, Naming & Standards

Tier 2: Daily (Minutes)

What: Intent + config baseline per site
When: Daily cron schedule
Scope: Full site
Typical time: 2-5 minutes per site (at ~250 devices)
Example policies: Config Analysis Baseline, Leaf/Spine Intent Baseline

# Cron schedule on policy
schedule: "0 6 * * *"   # Daily at 6 AM

Tier 3: Weekly (10-30 Minutes)

What: Full compliance + graph resilience audit
When: Weekly cron schedule
Scope: Full site
Typical time: 10-30 minutes per site (at ~250 devices)
Example policies: CLOS Fabric Design, Power/Network Resilience, compliance framework packs

# Cron schedule on policy
schedule: "0 2 * * 0"   # Weekly on Sunday at 2 AM

These tiers are approximate starting points. Larger sites or policies with more checks will take proportionally longer. Adjust based on your observed run times.

Scaling with Multiple Workers

Worker Parallelism

Site-scoped policies naturally distribute across RQ workers. Each worker processes one policy run at a time, with no shared state between workers.

Total wall-clock time ~ Total execution time / Number of workers

If a full validation suite takes 20 minutes on one worker, adding a second worker brings it to ~10 minutes. Four workers bring it to ~5 minutes.

Config Analysis Engine Capacity

The config analysis engine processes one snapshot at a time per instance. Scale by adding more instances:

1 instance per 2-3 concurrent config-engine policies
Each instance: ~2 vCPU + 4 GB RAM
Workers x instances = maximum concurrent config engine throughput

Contact NetBox Labs for guidance on configuring multiple config analysis engine instances for your deployment.

Summary: Quick Reference

Guideline	Why
Scope every policy to a site	Prevents any engine from processing your entire fleet at once
Role-segment intent policies	Linear speedup, targeted checks
Don't role-segment graph policies	Each segment rebuilds the full site graph
Consolidate config rules into fewer policies	Avoids multiplying fixed snapshot overhead
Separate graph checks from intent checks	Enables independent scheduling by cost tier
Use device targeting for pre-change intent validation	Sub-second feedback on changed devices
Use full-site runs for graph and compliance audits	Graph engine needs the full site regardless
Add workers for parallelism	Linear wall-clock reduction
Add config engine instances for config throughput	Enables concurrent config analysis

Policy Design by Engine​

Intent Engine​

Config Engine​

Graph Engine​

Scoping Policies for Scale​

Scope Every Policy to a Site​

Role Segmentation: Intent Only​

Separate Graph Checks Into Dedicated Policies​

Targeted Runs and Change Validation​

When Device Targeting Helps​

Pre-Change Validation​

When to Use Full-Site Runs​

Scheduling Recommendations​

Tier 1: Every Change (Seconds)​

Tier 2: Daily (Minutes)​

Tier 3: Weekly (10-30 Minutes)​

Scaling with Multiple Workers​

Worker Parallelism​

Config Analysis Engine Capacity​

Summary: Quick Reference​