The Design Stage: Navigating Network Automation Architecture

This article is part a 4-part series on Navigating Network Automation with NetBox. See the other parts covering: the Network Lifecycle Overview, the Build & Deploy Stages, and the Operate Stage.

In part two of this series, we’re focussing on the Design stage of the network lifecycle. As I mentioned in the introduction blog the purpose of this series is to show how network automation supported by NetBox can be applied across the network lifecycle, following a workflow perspective. Some teams have very simple processes which work for them, some teams have very sophisticated approaches which work for them. As network automation gathers pace we need to be constantly asking “what is an appropriate level of investment for our situation?” I hope to offer ideas for where you could focus if your current processes are no longer coping with the demands being placed on your and your teams.

In the introduction I also said that I wanted this to be a conversation and not a lecture, and what a conversation it has been! Over the past few weeks I’ve had the pleasure of interviewing network admins, engineers, IT Directors, architects, product managers and many more to help coalesce the findings presented in this blog. The results are far richer because of the humblingly positive response we’ve seen from across the community and I’d like to extend a big thank you to Steinzi at Advania, Alex Ankers at CACI, Arne Opdal at ATEA, Adam Kirchberger at Nomios, Kaon Thana at The New York Times, Roman Dodin, Timo Rasanen & Walter De Smedt at Nokia, Carl Buchmann, Julio Perez & Fred Hsu at Arista, Mike Wessling at Area42, Alex Bortok at Netreplica, Otto Coster at Netpicker, all those who preferred to remain anonymous, and of course our very own Rich Bibby, Jordan Villarreal and Chris Russell!

Following along with the series! Once you’re done reading about the Design stage, check out the Build and Deploy stage.

What happens in the Design stage?

Throughout this series we’ll see that in practice the boundaries between each of the stages are quite blurry, but it helps to establish the rule in order to explain the exceptions. In general the design stage is concerned with translating requirements into a set of buildable, deployable and operable design deliverables.

As mentioned in the introduction blog, not all changes follow a strict design process. Think of something like enabling a new service by adding an IP and a VLAN to an interface. These small jobs are more often captured in change management tickets in a tool like ServiceNow and handled directly in the NOC in the Operate stage. As we’ll see in a later blog the size and repeatability of these tickets makes them low hanging fruit for automation and delegation to others outside of the NOC, like Service Desks or even end users.

Requirements

The Design stage takes as input the intention, often but not always from someone outside the networking team, for a change to the network that delivers a desired outcome. These outcomes vary as much as the needs that any company can have: some changes are cost cutting exercises, some are about business expansion, others are about expanding the usage of existing networks, some are about improving processes in areas like compliance, the list goes on.

The way these intentions are expressed also varies greatly from “the people in our new branch office need to use the internet” all the way to “we need this much bandwidth, this much redundancy, we must use this vendor, we have X security requirements, we must be able to interact with the network through automation, etc”.

Sometimes the flexibility we have in designing a network can be further constrained from the outset. One example is government agencies who in order to comply with anti-corruption regulations and demonstrate process fairness will often put a finished design to tender, limiting the bids to mainly implementation. This can produce suboptimal outcomes when the government agency’s network design capabilities are inferior to those of the bidders, but must be accommodated all the same. Similarly, we’ll sometimes find that a “solution” has already been sold to the customer by another team, reducing, but not totally eliminating, the scope we have for design improvisation.

Design Deliverables

When we’re talking about translating business requirements into network additions or alterations each company has its own processes and cultural norms, but the general patterns are as follows.

The High Level Design is intended to capture the first step of translation between business requirements and the network. It’s typical here to see “boxes and lines” style diagrams that capture L3 routing, WAN connectivity/circuits and often a simple “back of the napkin” IPAM. The high level design also often captures some of the physicality of the network, with branch offices, headquarters and datacenters captured to give an overall sense of shape. In the high level design we’re more interested in establishing a lingua franca that captures the general intent and can be iterated on by multiple stakeholders, some of whom will not have a deep networking background. High Level Designs commonly include visual rather than tabular data and omit many implementation details, for example I may only care that I have enough IP space in each subnet but not what those IPs actually are, or I may only care that there is a device called “router” or “firewall” and not what the exact models are. High level designs can also be voluminous, frequently including pages and pages of written documentation including glossaries, technology overviews and more.

Low Level Designs fan out from the high level design to provide more specific details about different aspects of the network. Low Level designs most commonly include designs for L3, L2, Physical, and detailed IPAM. The Low Level design can also encompass related documents like cost planning, security considerations, PCI-DSS and NIST details, and designs for auxiliary management systems like monitoring and automated configuration deployment systems like push based config pipelines and pull based zero touch provisioning systems.

Refining the requirements

The Network Design Happy Path diagram above hopefully appears as unrelated to reality as I intended it to. While at a high-level the diagram captures the intention of the Design stage, in practice it’s a little more complicated.

I’ll know it when I see it

Let’s start by thinking about our stakeholders. It’s rare that our stakeholders will be able to describe in full detail exactly what they want from the network. While we may wish our customers knew ahead of time everything that they want to see, optimizing for “perfect” customers is likely going to be a dead end for two reasons:

Knowledge asymmetry – We hopefully know more about network design than our customers do, so we need to help them understand the available options that meet their needs

“I’ll know it when I see it” – it’s well documented across all fields of design that customers often can’t know exactly what they want until they are given options to choose from to help them narrow their options

That’s why in practice our design stages often end up being iterative.

Are we accounting for all our stakeholders?

It seems that too often when designing networks we’re only accounting for upstream stakeholders, and not thinking about those who will be doing the building, deploying and operating of the networks once the design is “done”. One interviewee noted that they are often asked to add automation to networks that weren’t designed to be automated in the first place, “often nobody has thought of automation during the design of the network, which makes it extremely difficult to bolt on later.”

The field of project management tells us that the later in a process we find out that we were wrong about an earlier assumption, the more expensive it is to fix. The concepts of “Cost of Change” and “Cost of Delay” capture the idea that problems caught after the design stage are exponentially more costly to fix the later they are caught. To combat this we see organizations seeking to include more expertise upstream in the design process so that problems with the design can be found when they are inexpensive to address, a move that is typically referred to as “shifting left”.

The pattern of shifting left can be found in all systems in which there is specific domain knowledge downstream which if included upstream could lead to better overall flow in the system. Consider the example of a software developer creating an application. They may have the domain expertise to create the application, but lack the expertise to test the security of the application. If the application is shown to be insecure in production this is more costly to address than if they had been able to fix it during development. If we “shift the security knowledge left” by providing the software developer with some way to confirm that their application is secure while they are developing it, the overall flow of correct software to production will be increased.

How do we know our designs are correct?

So we’ve seen how including stakeholders from the build, deploy and operate stages can help us to avoid costly mistakes when refining the requirements for our design, but even if we have the correct requirements, how do we know that our designs are correct?

All organizations accept that they just can’t catch everything in the design stage, but for some the risk of finding costly mistakes later in the process is unacceptable, so they take the concept of shift left a step further, and include elements of the downstream stages in the design process itself.

This pattern is broadly referred to as NetDevOps, and many organizations are applying aspects of it already. For example, imagine you’re designing a network and the requirements state that you must use a device vendor your team doesn’t have a lot of experience with. You might create a limited physical lab to test your core routing design to gain confidence that your proposed design will work when deployed. Some teams take this further with various types of virtual labs that can be deployed on demand to test the correctness of the design stage. This is covered in more detail below, and will also be revisited when we talk about the Deploy stage in a future blog, as investing in automation in the design stage often helps the teams operating the networks too.

The beatings will continue until morale improves… A common anti-pattern is to treat design correctness issues as a problem with our discipline. Think Method of Procedure documents and Change Control Boards. As the complexity and speed of change of our networks grow we’re being asked to manage an unsustainable number of repetitive tasks and decisions with great precision. If you find yourself thinking “if only our engineers were more disciplined, we wouldn’t have so many delays or outages” automation might help more than additional checks and balances. Discipline and process are necessary but can only get us so far. Computers are better at this type of work and engineers know this leading to what one interviewee referred to as “waterfall theater” in which detailed design documents are created to satisfy outdated process requirements, while the real work happens out of sight to get the job done

Further design considerations

Of course, there’s no such thing as a free lunch and as you may have noticed we’ve introduced conflicting requirements as we’ve evolved our design approach. On the one hand, we value rapid prototyping with our stakeholders so can we can get to the “I’ll know it when I see it” design that we think satisfies all, but at the same time we’ve introduced more stages upfront in the design process to make sure that the design we’re proposing will actually work.

In the more chin-stroking and theoretical echelons of the Agile and NetDevOps movements this would be referred to as the “transaction cost” of exploring a given design. In simple terms, being able to explore designs that are both applicable to the requirements and correct faster gives us options: do we take the first correct design and move on to building or do we spend the time we’ve saved exploring alternative designs that might be better?

Consider the other extreme which is far more common: our design process is so slow that in reality we get one shot at getting it right, so we are forced to overprovision for safety. Each organization’s sensitivity to this sort of overprovisioning waste differs, but few organizations wish their stakeholder satisfaction was lower!

Let’s look at some additional considerations through this lens.

Format

As we’re working with our stakeholders to refine their requirements and our design we often run into the challenge of translating our designs from their way of thinking to ours, which leads to us considering the format our designs should take. For example, it may be easier for us to reason from detailed designs because we’re able to understand the intricacies of the included details, but for some of our stakeholders, the amount of detail could make it harder for them to understand what we’re trying to convey.

For this reason we sometimes see visual representations in Visio, draw.io or LucidChart used to convey high-level designs as they are more intuitive for our stakeholders and easier to alter in real-time, but these high-level designs must be then accurately translated to low levels designs later to please our other stakeholders.

It’s important to keep in mind however that the format we choose is also our first line of defense for ensuring the correctness of our design. For example, while I heard about many examples of complicated spreadsheets that have been created to track cabling, interfaces and IPAM, in general a spreadsheet will be perfectly happy for us to add a 49th connection to a 48 port device. That leads many organizations to insert NetBox into their design phase at the low level design stage.

This achieves one type of compromise by combining a visual and written representation for rapid prototyping in the high level design stage with the correctness guarantees provided by NetBox for the low level designs, but still requires translation between the two. The potential for errors in this translation stage have caused some organizations to go a step further still.

As mentioned above, design documentation can be voluminous and I’m yet to see an example where every piece of what would normally be included in a high-level design would be put into NetBox. In researching this article a common theme was the appropriate level of investment for the desired outcomes. Many companies find their needs are met by using different formats for different stages and paying the translation tax, for others this is not acceptable and they seek to get as much as possible into NetBox as early as possible so that they can take advantage of shift left and NetDevOps workflows.

Of course the process of going from customer requirements to a populated NetBox is itself an area of consideration, which some organizations have addressed through design reuse.

Reuse

If we’re designing a network which is similar to something we’ve done before, we can save time and reduce risk by reusing these designs and changing only those parts that differ. Reuse is also a precursor to automation as we can only scalably automate what we can abstract into patterns and we should only automate what we expect to repeat (although we’re not always very good at predicting what we’ll need to repeat!)

Repeatability and Network Homogeneity

The subject of design reuse is of special interest to organizations who are in the business of designing, building and operating many networks, such as service businesses, large enterprises, or hyper scalers who are constantly expanding and updating their networks. This is exactly why hyperscalers deploy rows and rows of homogenized racks of equipment, and why clos networks are such popular examples in many network automation demonstrations; standardization and repeatability lend themselves to automation.

Having said that, you don’t need to be a service company, large enterprise or a hyper scaler, or even run a datacenter to enjoy the benefits of design reuse and network automation and we’ll be looking at many examples to demonstrate that point throughout this series.

Validated Designs

Most vendors provide validated designs which can be relied on as they have been tested by the vendor and shown to work and can also result in better support from vendors because of their familiarity with the pattern. Vendor validated designs tend to be technology focussed, for example a validated design for a three tier clos network, but are often supplemented with industry specific use cases and advice for those designs.

Some organizations maintain their own library of validated designs, which are often referred to as templates instead of validated designs and unlike vendor validated designs these templates are often customer focused, commonly relying on T-Shirt sizing as a primary variable. For example an MSP might maintain a library of validated designs called “Small Hospital”, “Medium Datacenter”, or “Large Office”, etc. These approaches are not mutually exclusive, for example, a “Small Hospital” template could well include a vendor validated core network design.

This use of templated designs can be used to overcome the problem of initially populating NetBox for organizations who would like to include their source of truth at the beginning of the design process.

Testing our designs

As mentioned above there are various approaches to verifying the correctness of our design, with NetBox often serving as an early line of defense.

Some organizations might just “eyeball” the designs and then leave any testing to the post deployment stage especially if the design is deemed to be low risk. Going a step further we might get some “loaner” equipment to create a limited physical lab if we’re working with an unfamiliar vendor, or we might already have a physical lab we use for testing new designs before we pass them along.

Network emulation is another approach which can provide more flexibility in testing, especially for control plane and management plane testing, and containerized device OSes are starting to rewrite some preconceived notions in this area around usability, repeatability and resource requirements. For most networks however we’re still not at the point where it would be realistic to create an entire digital twin of our production network and in fact needing to do so would likely point to problems in the composability of the network design. There are also network simulation, and other tools that can help here, and in practice a blended approach is necessary for the best results.

How much you invest in shifting left your design validation depends on your specific circumstances, but one advantage of using validated designs is that the designs often include sets of tests which can be predefined, or generated based on the variables provided to the validated design or both. This gives us reuse at both the design and the test level!

How to approach creating a testing strategy is outside the scope of this blog, but I’ll be covering the topic in much more detail in a future article including the sorts of validation we might look for in our designs, the different ways we can test that validity, and how we can string it all together with testing pipelines.

For an excellent discussion around emulated networks, testing patterns and more check out our Network Automation Heroes episode with Roman Dodin of ContainerLab and Alex Bortok of Netreplica to see how NetBox, nrx and ContainerLab can be created to build containerized network labs with ease! link

Automatibility of the design

As I mentioned earlier, one advantage of shifting-left downstream requirements is that we’re not leaving our downstream colleagues to find out the hard way that the network we’ve designed for them is not automatable. Covering all the aspects of the automatability of a network is a very large topic with many books already devoted to it, but there are some rules of thumb to be on the lookout for.

Device interoperability – do the devices we’re choosing support protocols like NETCONF, RESTCONF or even more modern approaches like gRPC? Not considering this upfront can limit the tooling options our downstreams colleagues are left with.

Platform homogeneity – Some vendors ship platforms which are the same across all editions, while others have offerings which appear very similar upon first inspection but can vary in functionality greatly depending on the specific version you’re running.

Config Updates – Older platforms may not provide features like candidate configs and in-place upgrades which can make automating them trickier.

Vendor Neutrality – Does our existing tooling support the platform included in our design? Our downstream colleagues may have invested in operational tooling which works fine with our existing fleet, but may require alterations to work with new vendors, or in some cases may not work at all. Building with neutrality in mind is top of mind for many teams not only because of broader lock-in concerns but also because of the flexibility it can allow in design decisions, especially when combined with NetBox’s vendor neutral data model.

Greenfield vs Brownfield

In all the examples above we haven’t touched on a very important consideration when we’re designing changes to our networks. Is this a greenfield deployment or are we altering a brownfield network that already exists?

Most of the time we’re going to find ourselves in a brownfield environment, but the exact definition of greenfield vs brownfield is best thought of as a scale. On the one end we might have what I’d call a “truly greenfield” environment in which we’re deploying a network for a brand new company with no existing infrastructure. That’s pretty rare though and more often when we’re talking about a greenfield deployment we mean “mainly greenfield”, for example we may be adding a new branch office for an existing company where everything in the network in that branch office is brand new, but it might still need to be connected with an existing WAN architecture. One step further along with having a “mainly brownfield” deployment in which could be something like extending an existing site where we may need to revisit some of our IPAM, L2 and L3 assumptions but we’re not touching existing equipment. Lastly there’s what I’d call a “truly brownfield” deployment, where we might be upgrading all the devices in a network requiring us to revisit all of our initial assumptions about the existing infrastructure to some extent.

In practice then, we spend most of our time in a situation where the existing infrastructure will influence our design decisions.

This is one reason why good network documentation is a huge time saver when designing networks. Imagine the effect on the design transaction cost, when the first step of any design process is to scour disparate outdated sources of information across the organization to first figure out what the current network looks like!

NetBox time savings for brownfield networks

Imagine running a network upgrade; how do you confirm that the upgraded network still performs as intended if you didn’t have a clear view of the previous intended state in the first place? Here you can start to see the compounding value that using a network source of truth like NetBox across the network lifecycle provides to organizations.

Summary

In this article we’ve come from a very simple view of the design process to a much more sophisticated approach, and I’ll be continuing that journey in the following articles about the Build, Deploy and Operate stages. We’ve also started to get a glimpse into the cyclic nature of network operations, and as we proceed through the following articles you’ll start to see how when applied to all stages of the network lifecycle you start to get something that looks like our Network Automation Reference Architecture.

Would you like to be a part of the conversation? Reach out on the NetDev Slack or on X (@mrmrcoleman).