This article is part a 4-part series on Navigating Network Automation with NetBox. See the other parts covering: the Network Lifecycle Overview, the Design Stage, and the Operate Stage.
Today we’re focussing on the Build and Deploy stages of the network lifecycle. I had planned to make these separate blogs but in practice they are fairly enmeshed and I will be going a layer deeper on deployments in the Operate blog, so it made more sense to present them in one go. As I mentioned in the introduction blog the purpose of this series is to show how network automation supported by NetBox can be applied across the network lifecycle, following a workflow perspective. Last time round we explored the design stage and saw great variability in the approach that teams take to designing their networks. Today’s blog is no different with organizations choosing different approaches based on the demands being placed on their teams, and we’ll also see that the patterns of shifting left and balancing different stakeholders needs are just as prevalent.
In the introduction I also said that I wanted this to be a conversation and not a lecture, and I’m pleased to report that this has been another very collaborative effort! Not only did many of the initial conversations I had for Design stage flow into these later stages, but I was also fortunate enough to add a whole host of new collaborators to the list. I’d like to extend a big thank you to Michael Costello board member of NANOG, Kevin Kamel at Selector.ai, Bill Borsari at WWT, Simon Kuhball at Akquinet, Steinzi at Advania, Alex Ankers at CACI, Arne Opdal at ATEA, Adam Kirchberger at Nomios, Kaon Thana at The New York Times, Roman Dodin, Timo Rasanen & Walter De Smedt at Nokia, Carl Buchmann, Julio Perez & Fred Hsu at Arista, Mike Wessling at Area42, Alex Bortok at Netreplica, Otto Coster at Netpicker, all those who preferred to remain anonymous, and of course our very own Rich Bibby, Jordan Villarreal and Chris Russell!
What happens in the Build stage?
In the design stage blog we saw that in practice the boundaries between the various stages are quite blurry, but it helps to establish the rule in order to explain the exceptions. In general the Build and Deploy stage are concerned with translating design into changes in the network that satisfy the requirements of the design, before “onboarding into operations”, or resuming Business as Usual (BAU) as it is sometimes referred to.
The teams involved in the Build and Deploy stages can vary greatly from one organization to the next. Some companies might have specialized datacenter operations teams, smaller companies might have all stages done by the people in network operations. “Remoteness” is also a factor here, for example a small network operations team might choose to use local contractors to make changes to a geographically distant network, rather than flying their people there.
The Build and Deploy Happy Path
As in the design stage blog, we’ll start with a simplified happy path and then see how various forces act to complicate matters. In researching this blog I found that naming conventions vary greatly depending on one’s background, so let’s cover some quick definitions: by Build I primarily mean the physical aspects of the network, and Deploy covers the logical network configuration.
In this simplified view our designs are handed to those doing the physical build for our networks, who then complete their work and hand the connected devices back to operations where the logical deployments can occur. Even in this simple description we start to get a sense of the collaborative nature of these stages.
In the design stage blog we saw how organizations strive for correctness in the design to try to overcome the exponentially increasing costs to fix issues later in the process. In practice however, there are many realities in the build process that are difficult to proactively address. We may find that the cable length we selected doesn’t work when someone tries to slide out a drawer, or the exact device model we specified isn’t available so another similar device is chosen. Some companies have entire racks pre-built and delivered fully cabled, but even here quality issues can creep in.
Sometimes our view of the actual state of the network when working in the design stage is incorrect. The teams doing the build out may arrive on site to find out that the intended state of the network in NetBox that was used for the design hasn’t been kept up to date with the actual state of the network. Cabling may have changed, or extra devices or patch panels have been added, etc.
The above diagram starts to capture the collaborative nature of these stages. While those doing the physical build out work will be primarily responsible for that stage, in practice the most effective organizations expect to have a constant line of communications open between the build teams and operations teams to overcome the challenges outlined above.
Inter-team coordination
In the Design blog we touched on greenfield and brownfield deployments and saw that in practice most network changes we’ll see are some form of brownfield. This distinction becomes very real during the physical build stage. Let’s use a couple of examples to get a handle on some of the variables.
The design requires that we upgrade the core switches in a site. This is clearly a brownfield situation, we’re changing the network that is currently being relied on by users. This is where change windows come into play and while some architectures give us more freedom than others, we’re always looking to reduce interruptions to network functionality. What this means in practice is that the work described in the design stage may need to be broken up into safe chunks that can be carried out within pre-agreed change windows. Organizations who need fast throughput therefore constantly seek to keep their intended state of the network in NetBox as up-to-date as possible to avoid unforeseen issues in the design stage impacting builds.
The scope of work assigned to a given change window can be a source of negotiation in many organizations. Risk averse NetOps teams may prefer to pad the windows to create enough time for safe rollbacks, and progress aware management may be inclined to push more in.
We can also find that the build stage tends to fan out across various work streams that are being completed by different people. The end result is a real project management task!
We saw how network homogeneity allows teams to take advantage of patterns like design templates in the previous blog, and those benefits manifest themselves at the build stage by allowing some organizations to order known sets of equipment all the way up to full preconfigured racks, or rows of racks. Even in this scenario however, issues do occur. These preconfigured packages have their own build processes with error bounds, although these error bounds do tend to get considerably tighter with repetition.
Translating the Source of Truth
So unforeseen issues in the build phase are close to inevitable, and how the teams seek to address this is with constant coordination during the build. This coordination is streamlined by having an objective source of truth like NetBox around which to base discussions, but commonly we see other tools used alongside NetBox at this stage, which introduces the need for a translation stage.
One purpose of this translation stage is to get the data into a format that the build team can use. For example, the way that cabling is presented in NetBox might not naturally lend itself to the process followed by those doing the cabling. For this reason build teams often generate their own spreadsheets to work from, although multiple interviewees told me that these spreadsheets are either already automatically generated from NetBox data, or there are plans to do that. Sometimes the teams doing the build out have broader responsibilities, such as DC operations teams, and these teams sometimes use separate tools to manage their Install, Move, Add, Change (IMAC) processes which also need to be populated with NetBox intended state data from the design stage.
Build teams experience first hand the unforeseen issues mentioned above and have often introduced Pre Build Checks to make sure that what they’re being asked to build in the design can actually be achieved. These cautiously assume that the design does have issues and seek to catch those issues early.
All of this means that during the build stage we can end up with two views of the world that need to be reconciled, the operation teams view in NetBox and the build teams view in their own tooling. Commonly this is handled as a matter of course in the constant coordination efforts between both teams in this phase, for example one interviewee told me that if their pre-cabled racks show up with cabling issues they have a decision to make: if the cabling issues are small they will physically fix them in the datacenter, if the cabling issues are large they will update NetBox to represent the state of the delivered rack instead.
This is an area with a lot of promise for automation efforts with many organizations already exploring how to integrate NetBox more in the build stages, with at least one plugin already out there, and with good reason. As one interviewee stated “when differences creep in at this stage it can be a real pain in the butt to get back in sync!”
Handing Back
When is the build stage done? This varies somewhat by organization. In small shops there might be one person doing the designing, building and operating so while rigor is still advantageous, the boundary between the stages is somewhat artificial. In other situations those doing the build out might have some access to the devices and might run verifications like LLDP neighbor scans to check that the cabling is correct.
In most cases though, the build stage nears completion when the operations team has connectivity to the management plane. Most of the people I spoke to said it was more common for the operations teams to run post build verifications of the network, sometimes leaving the build teams standing around waiting for the all clear before they can go home! Post build verifications are often referred to as Network Ready For Use (NRFU), and are concerned with making sure that the build actually meets the requirements specified in the design. A deep dive into NRFU testing is out of the scope for this blog, but we’ll touch on it again briefly at the end of the blog.
Typically the operations team takes over once they have connectivity to the management plane
What happens in the Deploy stage?
Once the build stage is complete and we’ve run our verification we can start deploying the logical network. Broadly speaking, there are four approaches to deploying our logical network: manual configuration, pushing configs, zero touch provisioning, and controller based systems.
As stated at the beginning of this article I’ll be covering configuration deployments in more detail in the next blog on the Operate stage, so this is intended as an overview of common options. I won’t be covering manual configurations as none of the interviewees I spoke to used manual configurations for their devices and at this level of detail controller based deployments aren’t different enough from zero touch provisioning to warrant a separate section either.
Pushing Configs
In order for us to push configs automatically to our devices we need a few things in place. We need management plane connectivity, we need a way to render our configs, and we need a process for applying the correct configs to the correct devices.
I want to use this juncture to touch on a useful NetBox usage pattern. In the Design stage blog I spoke about organizations sometimes leaving the exact details to the Build and Deploy stages. Organizations who order a lot of similar network designs often use NetBox Custom Scripts to populate the exact details before running their deployment scripts, and in this example I’ll focus on IPs. Strictly speaking, they could do this at any stage in the process as they are only affecting the intended state in NetBox, but in practice this is often performed pre-deploy in case they need to cater for unforeseen issues in the build stage.
Once the devices have the correct IPs in NetBox, we can generate our configs. While the exact method of config generation varies per organization, Jinja2 came up in almost every conversation I had. There are broadly two ways Jinja2 templates are rendered before being pushed to the network. Either organizations use NetBox Config Templates (often alongside NetBox Remote Data Sources), or they use the Ansible NetBox Inventory plugin and run their config rendering in an Ansible Playbook.
Zero Touch Provisioning (ZTP)
At larger scales, organizations often turn to ZTP to have devices “autoconfigured” upon startup. There are many specific approaches, but a common approach uses a combination like DHCP, iPXE and TFTP. Cisco PnP is another option in this space. Normally organizations will configure their DHCP with the MAC addresses and IP addresses from NetBox for each device, after the MAC addresses have been uploaded to NetBox following the build. When the device wakes up and reaches out to DHCP it is given an IP address and also an iPXE script to run, which then fetches the appropriate device configuration and applies it to the device.
In some circumstances devices or even entire racks are pre-configured before they are shipped and arrive at site for installation. This is another example of the pattern we saw in the Design stage blog, where organizations are seeking to find issues as early in the process as possible, when they are easier to solve. This approach can also be used when management place access to the devices post-build is restricted.
Network Ready For Use (NRFU)
Before we can wrap up this blog and hand our network to the Operate stage there’s one last step we need to cover. I briefly touched on NRFU above, and as part of that process there can be many auxiliary systems connected to our network that need to be checked off before we can say that the network is ready for use including config backups, config assurance, and more but we’ll just focus on the most important one: monitoring.
While some teams are happy manually adding new devices to their monitoring setups many teams have developed automated mechanisms to achieve this. As always, there is much variation in how this is achieved in the field, but there is a general pattern. Typically monitoring configurations are generated using a combination of NetBox device statuses and NetBox tags. NetBox device statuses allow us to say things like “only monitor active devices”, and tags allow us to specific things like “only monitor devices that are active and have a MONITOR tag” so that we don’t end up trying to monitor patch panels.
Summary
That was a lot! So far in this series we’ve seen how NetBox is used across the Design, Build and Deploy stages of the network lifecycle, how it interacts with other tools and some of the process challenges that organizations are overcoming with various types of automation. By now I expect that you’re starting to get an idea for how the Network Automation Reference Architecture works and we’ll be bringing it all together in the next and last blog in this series, the Operate stage.
I’ve been forced to skim over many interesting details in these blogs to keep them from becoming too long but fear not. One purpose of this series is to set a baseline for the lower level content to follow, and some of that content is already arriving. Be sure to head over to the NetBox Labs YouTube channel to follow along with our new Network Automation Heroes series in which we’re starting to add a packed schedule of deep dives.
Would you like to be a part of the conversation? Reach out on on the NetDev Slack or on X (@mrmrcoleman)