What Network Ops Can Learn From IT Ops

Instead of finger-pointing and blaming each other, network and data center operators can learn a lot from each other, especially as the different types of infrastructure converge and the same concerns apply to both.

There is a sea change underway in the networking world. I come from the data center world, and I only know enough TCP/IP networking to be dangerous – but over the last decade or so, I have watched the the two very different worlds of the network and the data center start to move closer together. I think at this point, we are close enough that we can usefully start sharing insights with each other.

I talk to telco NOC teams and to operators of large corporate networks fairly often, and it has been interesting to watch the change over the last few years, from the assumption that nothing that came from outside the networking world could possibly be relevant, to a gradual acceptance and even interest in what could be learned by looking beyond the traditional approaches to network operations.

A big part of this interest is driven by secular changes that are underway in the networking world itself. The typical conversation that I have with network teams goes something like this:

“We have this suite of tools, some commercial and some home-grown, that is pretty heavily customised to our network. We do some initial filtering of events out in the field to cut down on volume, and then we do some central processing to match events to their geographical location and so on. We also apply some heuristics, mainly time-based, that are the result of our experience operating this network for the past X years.”

At this point I ask the question: “Okay, so it sounds like everything is working pretty well. In that case, why did you agree to meet me?”

Why Did You Agree To Meet Me?

This is where the pained looks come out: “You’re right, this system is working well for us – today. If all we were looking at were an ongoing linear continuation of the past, we wouldn’t be looking at changing it. However, between some technological changes (SDN/NFV) and increased demand from the business side, we are looking at a step change in complexity. We expect a 10x increase in the number of devices we have to manage over the next few years, and many of those devices will be virtual and managed primarily by automated processes. We worry about losing control of the network. Can you recommend a different approach from what we are doing today?”

The interesting thing is, this conversation almost exactly mirrors the sorts of conversations that I was having maybe fifteen years ago in the data center world. I think network ops teams can learn a lot from what worked for the data center ops teams – and from what did not work.

Of course these two worlds are different enough that any lessons will not be perfectly applicable, but as the convergence continues, with increased virtualization and automation coming to the network world, the differences will come to seem less and less important.

Here are some thoughts around things that network people could usefully discuss with their data center colleagues.

Automation

Automation as such is not new. Every time you use a script to blow out a config change to a number of remote devices, of course that is automation. However, there is still a human in the loop, hitting the return key to trigger the automation. There is a step change that occurs when the human is removed from the loop, and automated actions are taking place in response to predetermined conditions without direct human intervention.

As the intelligence of the network grows, there is more and more scope for automated provisioning/decommissioning/change to occur in response to outside events, such as changes in patterns of user demand, or as a consequence of other actions, such as the rollout of a new over-the-top service which requires changes to the configuration of the network itself.

Understanding and troubleshooting a network where these sorts of changes occur constantly is a fundamentally different proposition from debugging actions triggered by human operators. Data center administrators have been dealing with this challenge for some time – both the increased levels of automation, and the increased variety of actors requiring configuration changes. They can offer a wealth of insight and experience to network operators just beginning to face these issues.

Abstraction

Abstraction is also not new to network teams. The OSI model describes seven layers of increasing abstraction from the physical layer on op, and at root, that model has remained valid since its introduction in the mid-Eighties. However, each of those layers has been getting more internally complex, and the details are starting to bleed over.

For instance, in a containerised world, there is significant overlap between the original OSI model and the added layers of abstraction and virtualisation that containerisation brings. This is all well and good as long as everything is working, but when something goes wrong and there is a need to peel back the layers of abstraction to debug the problem, it can be hard to tell which domain the problem lies in.

Again, data center people have been dealing with this problem for longer, and can offer experience – or at the very least, war stories and commiseration. The nice clean divisions between “operating system”, “middleware”, and “application” broke down a while ago for them, and many of the bits ended up being operated by different teams that did not always sit where they might have been expected to.

Acceleration

Either of the previous two trends would be survivable alone. The transition might be messy, but it could be managed, as every other transition has been in the past. The factor that makes this one different is the vastly increased rate of change, and its continuing further acceleration. New services are launched more and more frequently, and each one requires complex back-end configurations to support it. Those configurations need to be made automatically, without human intervention, because of the time scales involved. In turn, in order to make the changes easily comprehensible to the various teams and roles involved, various levels of abstraction are introduced, so that a “release” or a “service” can be discussed as homogeneous objects.

It is this acceleration in combination with increased automation and abstraction that makes old approaches no longer fit for purpose. It’s not just a question of making filters smarter or modelling the network better; filters and models simply cannot keep up any more. Data center teams realised some time ago that their old, CMDB-centric assumptions no longer held true, and have been actively exploring alternatives that work in their newly dynamic environment. Many of these can usefully be extended to the network world as well.

Of course there is also much that people with data center could usefully learn from the experience of network people. Here are some ideas to get the conversation flowing both ways.

Observability

Network infrastructure has always been much easier to monitor than compute, because it mostly generates its own events, rather than requiring several agents to be installed on each piece of infrastructure in order to send information to a central point. Now, new forms of compute are also moving to this model of automatically instrumented infrastructure, where they no longer require agents to be installed, but rather offer an endpoint that can be queried or even automatically transmit events.

This requires a fundamental shift in assumptions. Instead of a monitoring event being something valuable and expensive that requires effort to capture, these days it’s more a question of sifting a few valuable events from a constant torrent of data that is being generated all the time. Network admins can offer their data center colleagues a lot of valuable perspective about what that means in practice.

Mapping

Networks have historically been easily mapped, with good and relevant location information. If anomalous events are coming from locations A and D, but not B and C in between, it’s a fairly safe bet that those two are not related. The discipline of network operations has historically had access to a variety of tools and techniques that took these mapping data and made use of them in various ways.

On the data center side, location data was usually pretty much irrelevant. This has not really changed, but on the other hand, application architectures have become sufficiently complex to generate their own logical topology, which is comparable in depth to the sorts of architectures that network teams have been used to working with. Data center admins could really benefit from advice about what sorts of topological analysis are helpful and which are less so.

Visibility

Historically, the data center team(s) and the network team have not seen eye to eye, in no small part because of the finger-pointing that occurs when it is not immediately clear where the root cause of a problem lies. The network is often the first to be blamed for an outage, and the network ops team have to laboriously prove their innocence before further troubleshooting can begin.

These days, with services spanning the divide, new architectures that blur the differences, and business demand requiring that teams work together across the gap, it is time to set aside that historical suspicion and start working together and learning from each other.

One way to do that is to adopt new techniques that are built around the sorts of common concerns that apply to both teams. As both network and data center operations are dealing with increasingly dynamic and fast-changing environments, new approaches are needed that understand this new world and can satisfy the requirements of both worlds. Gartner’s model of AIOps has been developed specifically for this new environment, spanning all types of infrastructure, accommodating huge volumes of data and rapid pace of change through algorithms and machine learning, and fostering collaboration across team boundaries – including, yes, between the network and data center teams.

Imagine a war room that didn’t start with blaming the network, and instead enabled true cooperative troubleshooting and information sharing. If we all share our experience and knowledge, this is what we can get to.

Cookies