Nobl9: Leveraging SLOs for Scalable Reliability with Business Context

It’s that 3 a.m. notification that no site reliability engineer wants to receive: your monitoring tools are firing off alerts about constrained resources supporting your customer-facing application, and you need to fix it ASAP. But what if the issue is one that your customers won’t notice and that doesn’t impact your business? How could you know when an issue is critical and time sensitive versus one that can wait for the morning?

Welcome to one of the key challenges facing IT operations teams today — how to meet Service Level Agreements (SLAs) without creating alert fatigue or blowing up your cloud budget. Monitoring cloud stacks with an eye toward reliability, while balancing performance and cost, has become increasingly complex in a world of containers, microservices, third-party APIs, cloud databases, multiple cloud providers … the list goes on.

For Nobl9’s CEO, Marcin Kurc and Chief Product Officer, Brian Singer, this challenge represented a problem that called for a fundamental pivot in the way we think about observability in the cloud. Drawing on their experiences at Google and AWS, they co-founded Nobl9, a software reliability platform that helps developers, IT operators, and SREs manage their IT infrastructure with business objectives in mind.

At the heart of the Nobl9 solution are Service Level Objectives (SLOs), key elements of an SLA mutually agreed upon by the software provider and its customer, whether the customer is external or internal. These measurable targets can be around application availability, latency, throughput, error rates, disaster recovery, and other variables — and putting these targets into an appropriate business context is critical to ensuring ITOps activities align with business needs.

Part of this involves “error budgets,” defined as the acceptable error rate for a cloud service. Whether the SLOs are around a mission-critical, customer-facing application with a “four nines” (99.99%) SLA or an internal application that just needs to function most of the time, there’s always some room for acceptable risk. This is where the concept of error budgets comes into play. If your SLOs dictate that an application can reasonably have 10 minutes of downtime per month, why wake your SREs in the middle of the night when the application is on track for just a few minutes of downtime? Why blindly deploy more cloud resources to increase availability, while materially increasing your cloud bill?

The answer is you don’t need to, nor should you. But you can only reach that conclusion if you understand how your error rates compare to your error budgets. And that’s where Nobl9’s SLO monitoring platform separates itself from traditional monitoring tools. By providing IT operators with the visibility needed to meet their SLOs, Nobl9 keeps services up and running, customers happy, cloud bills down, and SREs well-rested. That’s why we at Cisco Investments are thrilled to be investors in Nobl9.

Cloud-Native Complexity Necessitates a Modern Approach to IT Operations

Before launching Nobl9 in 2019, Kurc and Singer both worked at Google, and Kurc previously spent time at AWS. It was Google and AWS that pioneered SLOs to create a scalable relationship between resources (both human and technological) and availability, in an effort to maximize customer satisfaction while managing costs.

But the rise of Kubernetes and microservices ultimately was the impetus for Nobl9. Adopting a cloud-native stack speeds the creation of modern applications, but it also embodies an architectural shift that is orders of magnitude more complex than a traditional monolithic stack.

“We believed very strongly—we still do—that if Kubernetes keeps growing, then people cannot run those environments without SLOs,” says Kurc. “And we’re getting to the point where it’s very obvious for a number of organizations out there that what they’re missing is SLOs.”

Another tailwind driving demand for SLOs is a focus on customer satisfaction. The head of IT operations for a software company may, for example, request $5 million of additional budget to fund increased capacity in their public cloud. When the CFO asks what they will get in return, the obvious answer is, “we need it in order to serve our customers.” Far less obvious is how to measure this.

“SLOs are really, really good at doing this,” Kurc says. “They create this bridge between the infrastructure or the technical parts of the organization and the business context of customer satisfaction. And that’s very important to our customers.”

What's Next for Nobl9?

Kurc says his main focus is growth and fostering a company culture where employees are not afraid to fail, learn from their mistakes, and move forward.

“It’s a continuous challenge,” he says. “I’ve spent a lot of time on it.”

This much is certain: the immense complexity around monitoring cloud stacks will continue to drive demand for solutions that take the heavy lifting out of customers’ hands. With cloud providers likely to try and meet some of this demand, Kurc expects providers of on-premise solutions will be pushed to take an architectural approach and to move higher up the stack. In time, AI and automation could play a greater role in managing infrastructure.

“I think simplification of monitoring is definitely going to be a huge theme,” he says.

Cookies

Cloud-Native Complexity Necessitates a Modern Approach to IT Operations

What's Next for Nobl9?