This article is part a 4-part series on Navigating Network Automation with NetBox. See the other parts covering: the Network Lifecycle Overview, the Design Stage, and the Built & Deploy Stages.
Today we’ve finally arrived at the Operate stage of the network lifecycle. As I mentioned in the introduction blog the purpose of this series is to show how network automation supported by NetBox can be applied across the network lifecycle, following a workflow perspective. After kicking off the series with the Design stage, we then covered the Build and Deploy stages in the last blog. We’ve seen great variability in the approaches that teams take to designing, building and deploying their networks and the Operate stage is no different.
In the introduction I also said that I wanted this to be a conversation and not a lecture, and I’m pleased to report that this has been another very collaborative effort! Not only did many of the initial conversations I had for Design, Build and Deploy stages flow into this last stage, but I was also fortunate enough to add new co-conspirators to the list. I’d like to extend a big thank you to Michael Costello board member of NANOG, Kevin Kamel at Selector.ai, Bill Borsari at WWT, Simon Kuhball at Akquinet, Steinzi at Advania, Alex Ankers at CACI, Arne Opdal at ATEA, Adam Kirchberger at Nomios, Kaon Thana at The New York Times, Roman Dodin, & Walter De Smedt at Nokia, Carl Buchmann, Julio Perez & Fred Hsu at Arista, Mike Wessling at Area42, Alex Bortok at Netreplica, Otto Coster at Netpicker, all those who preferred to remain anonymous, and of course our very own Rich Bibby, Jordan Villarreal and Chris Russell!
To intent based network, or to not intent based network. Is that the question? Throughout this blog we will be assuming that operations are using NetBox as the “intended state” of their network in which changes are made in NetBox first and then propagated to the network via the mechanisms that we’ll discuss later in this blog. However despite many people passionately stating that this is the “one true way” to use NetBox, we see teams successfully using NetBox in what we call “Observed State Mode”. Observed State Mode is a perfectly valid pattern for using NetBox and often serves as a stepping stone to Intended State Mode. I’ll be covering all of this in more detail on our blog on May 28th, so stay tuned.
Getting started: Populating NetBox
In the previous blogs, we’ve assumed that you have already modeled your network in NetBox, but how does that data actually get in there in the first place? We see many approaches, but they can be abstracted into some useful patterns
First we need to consider how we’re going to learn what is already in the network, and this is an area that we often encounter when we’re onboarding NetBox Cloud and NetBox Enterprise customers who aren’t already using Open Source NetBox.
Some teams maintain accurate network documentation but find that their tooling no longer fits their situation. We like to joke internally that Spreadsheets are NetBox’s main competition but there’s truth in that statement. The fact is that spreadsheets are incredibly useful in many situations and are the go to tool for many tasks in almost every discipline, but often don’t scale very well.
If the existing documentation is creaking under the strain but otherwise up-to-date then this is a great place to start and typically involves one of two approaches: uploading CSV files, or using the API.
Sometimes however, we find that either there is no existing documentation of the network, or the documentation that does exist is either incomplete or out of date. In these cases organizations often reach for a network discovery tool to help them understand what is in their network.
Network Discovery has been a major area of focus at NetBox Labs and just this year we’ve announced integrations with Forward Networks, IPFabric, Slurpit and SuzieQ for network discovery.
The initial population of NetBox is a great opportunity for many teams to go through a much needed clean up of the network and address any technical debt that they’ve built up over time. As one interviewee said “you’d be surprised, but simple things like inconsistencies in device naming conventions can be a constant source of frustration!” Whether it’s something as seemingly simple as device naming, or something more involved like finding out that we have a bunch of devices in the network that we didn’t know about, this inventorization of the network before the initial import to NetBox can already drive a huge amount of value for organizations.
Because this stage can involve moving from an insufficiently documented and understood network to a well understood network, this stage can often be a lot of work, but doing this work is valuable not only because organizations now find themselves with a thorough understanding of their network, but also because the exercise sets the precedent for the operations team staying in control of the intended state of the network, which they do through Reconciliation, which we’ll cover below.
Day to day operations
So we have a network and we have our representation of the network in NetBox. Let’s look at what the day to day of the operations team looks like in this situation. While the full gamut of how operations teams do their work is beyond the scope of this article, there are a couple of patterns that are relevant.
Broadly speaking there are two ways the operations team can be called upon: the network needs to be changed, or the network appears to be broken.
Changes to the network
As we saw in the Build and Deploy blog it is incredibly useful to have an up-to-date view of the network when reasoning about changes that need to be made to the network, and that’s exactly what most of the teams I spoke to use NetBox for in this situation.
For routine changes there is also a pattern of delegating changes deemed low risk and low impact to others outside of the networking team. For example Dartmouth College successfully reduced load on their network operations team by allowing service desks some fine grained permissions in their NetBox instance so that they could handle a subset of low risk day to day tickets.
Problems in the network
Of course the other way we can be called upon is if something appears to be broken in the network. This could be from our Observability or Assurance tooling alerting us to an issue in the network, and we’ll cover that in more detail below. It could also be a ticket from a user who believes that the network is the source of an issue that they are facing. In this case it can be very helpful to be able to see what the network should be doing, and NetBox is a great place to start orienting oneself with the network before diving into solving the issue.
As part of diagnosing issues our monitoring can be a useful source of information, but sometimes it’s also useful to be able to run ad hoc commands to figure out what’s going on. While some teams are happy to log into the devices to do this, organizations who’ve been bitten by “quick fixes” in the past often prefer to lock down access to their devices and use automation tools to run troubleshooting commands and collect the output. We’ll see some of those in the following section.
Reporting
Another crucial aspect of network operations is reporting. This involves generating and sharing insights on the state of the network, changes made, and any issues encountered either for ad-hoc internal reporting or as part of more formalized auditing processes.
In the interviewees I spoke to the approaches for reporting varied greatly and were largely dependent on the industry the interviewee was working in and the associated compliance and regulatory frameworks that were in place. Typically reporting duties require data from many sources across the business, but NetBox helps in two areas. NetBox Reports, introduced in version 2 and combined with NetBox Custom Scripts in version 3, allows NetBox users to generate useful reports. More recently NetBox Event Streams entered the NetBox Labs Private Preview stage including an integration for Splunk Enterprise giving operators the ability to easily generate time based reports of NetBox data.
Assurance, Monitoring and Alerting
One of the key operational challenges that teams face is knowing that their network is functioning as intended, and most often this means that organizations reach for observability and assurance tooling.
Assurance and Reconciliation
Getting data into NetBox in the first place is a huge leap forward for many organizations. Once this is done however, the focus is then on making sure that NetBox and the network stay in sync. So what could lead NetBox and the networking diverging? As we saw in the Design, Build and Deploy stages, many teams drive changes to the network through NetBox, and those changes can hit complications during the build stage, leading to divergence. Catching this divergence is a key part of the Network Ready For Use (NRFU) stage that we covered in the Build and Deploy stage blog.
It’s not uncommon for unplanned changes to happen in the network too: someone connects a switch (you did enable BPDU Guard right…), a device dies, an engineer SSHes into a device to make a quick change to a VLAN and doesn’t update NetBox. All of these events and more can mean that what we think is in the network, because it’s in NetBox, actually isn’t correct in the network itself.
This can also happen the other way around. Perhaps someone added some devices in NetBox in the wrong state too early. In a future deep-dive we’ll cover the exact steps that teams follow when planning network changes in NetBox to avoid false positive reconciliation alerts, but broadly teams either use status fields, or more advanced workflows involving staged changes.
In practice organizations may use different tools to catch divergence at the device and configuration levels. While network discovery tools can read parts of device configurations to find out what’s in the network, there are also specialized workflows that configuration assurance tools focus on which warrants separate discussion.
Most organizations have policies dictating what should and should not be in a device configuration. For example these could include vendor provided security best practices, safety checks like enforcing BPDU Guard on access ports, or other company specific rules. Some organizations create their own scripting for this use case relying on open source libraries like NAPALM, Nornir or Netmiko to pull information from the devices, but there are also commercial tools that typically combine config backup functionality with configuration assurance testing.
The general pattern is for these tools to periodically run config backups on the devices, and then run those backups through various policy tests like those mentioned above. One obvious test is to check that the config on a given device matches what was originally pushed to the device, and we’ll touch on generating device configs from NetBox in the Automation stage below.
For an example of configuration assurance testing check out NetBox Automation Heroes: Proactive and Reactive configuration assurance with NetBox and Netpicker
Monitoring
A detailed overview of everything included in observability is outside the scope of this blog, but most commonly what we mean when we say observability is monitoring. Approaches to monitoring vary but the common pattern is to see devices grouped by type and have monitoring policies associated with groups. For example we may have one monitoring group called ‘Core Switches’ which is set to monitor devices for Uptime, Interface Errors, CPU Usage and another group called Access Points which is set to monitor Signal Strength, Client Count, Bandwidth Usage.
Organizations are always on the lookout for ways to avoid alert fatigue: the state of becoming desensitized to frequent alerts, leading to a decreased response rate and potentially missing critical warnings. One source of unnecessary alerts is when our monitoring gets out of sync with our network. To solve this, companies frequently generate their monitoring configurations from NetBox directly. This way when a change is made in NetBox to signal that a change should be made in the network, for example removing a device from the network, the monitoring is updated to that new state automatically so the operations team doesn’t get alerts for a device that we know shouldn’t be monitorable.
Alerting
When an assurance or monitoring tool notices that something isn’t right, most frequently it creates an alert for the operations team. There’s another aspect of alert fatigue here: it’s not just the frequency of the alerts that can cause desensitization, but also the usefulness of the alerts in pointing operators to the source of the issue. This usefulness generally depends on how specific the alert is, and how much context is provided in the alert.
Alert: Device performance issue detected.
Above is an example of an alert which is neither specific nor includes adequate context. This puts a lot of burden on the responding engineer to figure out what’s going on.
Alert: High CPU usage detected on core router R1.
The second alert is much more specific than alert 1, so that’s a step in the right direction, but we could also provide more context. For example if we add a link from our alert to the affect object in NetBox, the operator can immediately jump in to see all the relevant context for the affected device.
Alert: High CPU usage detected on core router R1.
Check details here: [NetBox Link](http://netbox.example.com/device/R1)
That’s much better, but we can go a step further and actually pull the context from NetBox when creating the alert so that the responding engineer has less swivel chair operations to perform to figure out what’s going on.
Alert: High CPU usage detected on core router R1.
Device Details:
– Device Name: R1
– Location: Data Center 1, Rack 7
– IP Address: 10.0.0.1
– Uptime: 120 days
Check full details here: [NetBox Link](http://netbox.example.com/device/R1)
More advanced organizations take alert fatigue seriously and frequently enrich their alerts with data from NetBox because they know that as they increase the specificity and context of their alerting, their Mean Time to Resolve (MTTR) for issues decreases, meaning less downtime.
Automation
On the lower half of the reference architecture we have “Automation”. This is primarily concerned with pushing the intended state of the network from NetBox to the network itself. We might revisit this naming over time, as there are examples of automation all over the reference architecture and we’ll touch on that in the Orchestration section below, but for now just think “deploying the configs.”
Generating Device Configs
How do we create configurations? NetBox 2.4.0 introduced Configuration Contexts for associating “arbitrary data with a group of devices to aid in their configuration”, and NetBox 3.5.0 introduced Configuration Template Rendering and Configuration Templates.
As we saw in the Build and Deploy blog Jinja2 templates are pretty much ubiquitous at this point. NetBox Configuration Templates allow us to associate Jinja2 templates with devices, or groups of devices which are populated with data from NetBox when using Configuration Template Rendering. Configuration Contexts can be used to add additional data that can be used in the device config generation, across devices. For example, we might use Configuration Context to define the syslog servers that should be present in our rendered device configurations.
NetBox 3.5.0 also introduced Remote Data Sources which “represent some external repository of data which NetBox can consume”, specifically Local Directories, S3 buckets, or Git repositories. Often organizations would like to keep their configuration templates and configuration contexts in Git so they can collaborate on them with GitHub like workflows. NetBox Remote Data Sources allows users to do exactly that.
Sometimes organizations either already have config generation functionality built in other tooling, or they prefer to keep their config generation code outside of NetBox. A common example of this involves using the NetBox Ansible Inventory Collection and Ansible Playbooks to generate device configurations outside of NetBox.
Pre Change Testing
Andrew Lerner of Gartner reported that 65% of changes to the network are still manual and that 2% of all manual changes to the network result in an error. 2% might seem low, but at scale that can add up to a lot of errors, and that’s before we even take into account the impact of those errors.
Because of this many organizations I spoke to are seeking to reduce the impact of bad changes to the network frequently introduce “Pre Change testing” into their change processes and while this is a growing practice, a recent survey by Shamus McGillicuddy at Enterprise Management Associates showed that only 11% of teams run pre-change testing on 100% of their changes to the network.
Generating configurations automatically can both save time and reduce errors caused by manual changes, but we also need to check the configs that have been generated to get some certainty that they will not break things. The types of checks involved in pre-change testing vary in practice but can be best understood by splitting them into two buckets.
Compliance and Security
Ensuring that network changes comply with internal policies and security standards is crucial to maintaining a secure and reliable network. For example we may want to check that the configs that we’ve generated adhere to predefined security and operational policies, including hardening policies or regulatory frameworks like HIPAA. Netpicker.io is an example of a tool that is tightly integrated into NetBox can be used in this space to run a set of tests, or policies, against generated configs before pushing them to the network.
Impact Analysis
When performing impact analysis we’re asking our tooling to tell us if the changes we’re about to push to the network will break anything. Forward Networks and IPFabric are just two examples of integrated NetBox Labs partners who are often deployed in this situation. These tools simulate the proposed changes, analyzing the potential effects on connectivity, performance, and network stability and reporting on those for human validation before the changes are pushed to the network.
There are many other tools that can be used in this area including PyATS, SuzieQ, and Batfish but the general pattern holds across all of them.
Configuration Deployments
Once we have our tested configurations ready to be pushed to the network there are broadly two categories of options: pushing the configs, or using a controller based system.
Pushing the configs
There are many methods for interacting with network devices to push configs and they are best understood by thinking about how much they abstract the underlying device details.
At the most specific level we have vendor libraries like Juniper’s PyEZ, Cisco’s various tools, Arista’s eAPI, HPE’s HPE Comware Python SDK. Each of these allows the user to push the newly generated and tested configurations to the devices, but leave much of the surrounding logic up to the user.
One step up we have vendor neutral tools, that sometimes use vendor specific libraries under the hood, and which seek to provide a single interface for deploying configurations to devices from multiple vendors, with examples including NAPALM and Nornir. These tools provide an abstraction layer to the devices by providing generalized functions like NAPALM’s load_merge_candidate() and commit_config() functions which are then implemented by device specific drivers for each vendor. Nornir goes a step further and provides additional functionality for parallelising tasks so that you can update multiple devices simultaneously.
A step further up in the abstraction layers we have generalized configuration management tools like Ansible and SaltStack. Tools like Ansible and Saltstack are often also used to minimize the amount of change being applied to a device. For example Ansible is commonly used to compare the running config on a device to the new config that is being pushed and only apply the differences, reducing the surface area for issues to occur and speeding up deployment times.
These vendor neutral and generalized configuration management tools are extremely popular in the NetBox community and as a result many well maintained integrations have emerged including the NetBox NAPALM plugin, the Nornir NetBox Inventory, and the Ansible NetBox Inventory Collection.
Controller Based Deployments
While the exact implementation details of controller based systems and zero touch provisioning systems vary, for the purposes of this article they are close enough to be considered together. Controller-based and zero touch provisioning (ZTP) systems add additional functionality into the config deployment stages, including in many cases offering rollback capabilities while providing centralized control and visibility.
In practice what this means is that instead of taking care of pushing the configs to the devices ourselves, we send them to an external system and let that system take care of making sure that the correct configurations land on the correct devices.
Post-Change Testing
In the Build and Deploy blog I introduced the concept of Network Ready for Use (NRFU) which is directly analogous to Post-Change Testing and refers to the series of activities and checks performed after network configurations have been applied to ensure that the network is functioning correctly and meets the intended design and operational criteria.
If you were thinking, “wait a minute, isn’t this the same as the Assurance, Monitoring and Alerting section we covered earlier?”, you’d be dead right, and this is why the network automation reference architecture is cyclical. In the post change section we’re running all the same checks and balances that we’d run on a live network before concluding that the change was successful and moving on to the next jobs. It is however common to run some checks that are specific to the deployment, and these vary depending on the nature of the involved changes.
Another advantage of using NetBox as the intended state of the network is that once the configuration is correctly deployed, there’s no additional step to go back and update our network documentation, as it was already documented in our intended state.
Orchestration
Throughout this article we’ve seen many examples of tools interacting with one another, but we haven’t discussed how all this fits together. I’ll cover orchestration in more detail on an upcoming Network Automation Heroes episode where we’ll use StackStorm as an example, but as this was such a popular topic in the interviews for this series I wanted to give a quick overview to establish the mental model.
Scripting
In the cloud world the Go programming language, buoyed by the Kubernetes ecosystem has become the de facto language, and while we’re seeming glimmers of Go adoption in the network automation space, Python is the language du jour and because of that Python scripts are the most common approach to tying together disparate systems for automation.
Python scripts are a great way to get started quickly because they have a simple syntax and forgiving typing system, and enjoy a rich package ecosystem allowing users to stand on the shoulders of giants for nearly any task imaginable. Getting started quickly is particularly important for teams who might not have management buy-in for their automation activities, so showing something of value quickly is paramount.
Once we start writing a lot of Python scripts however, maintaining them can become problematic, especially in the absence of dedicated software engineers who have experience refactoring code to keep it clean as systems grow.
Workflow Engines
Workflow engines show up in many places for tying small individual steps together and in fact those small steps are often Python scripts. I could have said CI/CD pipelines here, as they are just workflow engines underneath anyway, but the use of these workflow engines is broader than just configuration generation and deployment where CI/CD tools are most often found. Just in this blog for example we’ve seen areas where workflow engines could be used to automatically generate and push monitoring configurations, where they could be used to automate exploratory diagnostic tasks, and more.
The tools used here vary a lot, but one way of thinking about their functionality is how much they give you out of the box when it comes to interacting with other systems. Jenkins is the perennial CI/CD tool that has been molded to run all sorts of workflows but offers little assistance out of the box, StackStorm is a little younger than Jenkins and includes community maintained “Action Packs” to interact with NetBox and other tools, and then there are commercial offerings like our partner Itential who have gone to great lengths to take a lot of the drudgery out of maintaining workflows and include advanced RBAC and reporting capabilities.
Event Driven Architectures
Another approach we see in advanced organizations is the use of Event Driven Architectures. This is covered in great detail in our webinar from late February, but is an area that is going to see a lot more traction for two reasons: simplicity and scalability.
In event driven architectures events are passed around our orchestration system via a message bus such that “publishers” put useful events on the message bus and “subscribers” listen for events that pertain to their function. This pattern means that the individual parts of our automation can remain simple and focussed on a single purpose making them easy to create, maintain and debug.
Conclusion
So that’s a wrap! This has been a hugely enjoyable process for me, especially because I got to chat with so many network automation co-conspirators along the way.
In the introduction blog we looked at the challenges facing network automation, in the Design stage we saw how companies are seeking to “shift left” parts of their processes in order to catch issues earlier when they are less expensive to fix, in the the Build and Deploy stage we saw how network operations teams have to collaborate with many parties to get physical changes to the network deployed and today we saw how for some more advanced organizations the whole thing comes together as the intent based cyclic whole we see in the Modern Network Automation Reference Architecture.
I hope you picked up something of value during this series, whatever stage of the network automation journey you are on. As I said in the Build and Deploy blog, I have been forced to leave out a huge number of very interesting details throughout for the sake of brevity, but fear not, now that this initial series is wrapped up we will be doing a ton of deep dives to get into the specifics and in fact they’ve already begun.
Check out our Network Automation Heroes playlist on YouTube to see the current episodes and subscribe for future updates, and subscribe to our newsletter to make sure you catch all the written deep dives to come.
Would you like to be a part of the conversation? Reach out on the NetDev Slack, on LinkedIn (markrobertcoleman), or on X (@mrmrcoleman).