By Harshjit Sethi, Mayank Porwal and Sidhant Goyal 

When apocalyptic events happen in the digital world, all eyes turn to the Site Reliability Engineer. Imagine that the login service of a global web application goes down, and millions of users can’t access their accounts. Or consider that the checkout service of a popular e-commerce website goes down, and the business temporarily stops generating revenue. It’s the SRE, the firefighter of the software world, to the rescue.

Site reliability engineering is no mean task. The world is migrating to a microservices architecture operating on a vast and dynamic mesh of connected infrastructure and managed services on the cloud and on-premise servers. As complexity in software infrastructure compounds, there are multiple new points of failure, and the probability of an anomalous event bringing down the entire system increases. Add to that pace of innovation, one of the few real moats in an uber-competitive digital world: modern developers wear multiple releases per day as a badge of honor. It is evident why reliability, or ensuring that your system is ‘up’ and performing per spec, is a challenge.

To peek into your ‘black box’ software systems and know the internal state from indicators such as metrics, logs and traces is the holy grail of observability, and the quest for this elusive goal has created a $100 billion market cap industry, featuring legendary companies such as Splunk, Elastic and Datadog. However, when a high-severity incident occurs, it is still human intelligence that saves the day. A war room of SREs, developers and engineers with deep context on the system work round the cloud, analysing data across multiple tools to identify and resolve the root cause. In most cases this takes hours if not days, with massive adverse impact on the business.

Piyush Verma and Nishant Modak, two of the most recognized SREs in our region, faced this problem for more than a decade. They realized that what SREs lacked was a shared mental model of their infrastructure. In early 2020, they set out to build this.

Last9, which joined Surge, our program for early stage startups, in 2020, offers a novel approach to reliability. At its core is a knowledge graph of the microservices and infrastructure components that comprise software systems, and their interconnections. Each node of this knowledge graph is married to a time-series database that stores metrics for each component, which maps how services and components are interacting at any point in time. This interplay of the interconnections and interactions of physical and logical components leads to two significant outcomes.

  1. Rapid root cause isolation in event of downtime: If any microservice goes down, the tool automatically traverses the knowledge graph in minutes to isolate the specific component / service that is malfunctioning, enabling rapid remediation.
  2. Ability to predict system downtime in advance: Decaying metrics for any component (e.g., a database instance, an API gateway) that are precursors of impending service downtime can be identified proactively, due to the knowledge graph which can predict cascading failures.

These twin effects solve the biggest pain point in an SRE’s life. No longer do they need war rooms, multiple stakeholders and an array of software tools to prevent system downtime: Last9 helps them isolate root cause for events that occur, and proactively warns them of impending failures. In a world where the demand for SREs far exceeds supply, SRE automation tools like Last9 enable every company to incorporate reliability into their roadmap without hiring an entire SRE team.

We partnered with Last9 on Day One, working with Piyush and Nishant to flesh out the ideas even before they started Surge – and through the early days of product development with Disney+ Hotstar as an early customer and design partner. It was here that Last9’s technology was created and battle-tested, where it is used to ensure reliability during mega-events like the Indian Premier League with as many as 5 million requests per second and 30 million concurrent viewers. 

Nishant and Piyush believe that managing systems in production can be fun and embarrassingly easy. Their product has seen true web-scale from the very beginning and its footprint is now spreading across large enterprises in India and the US. We are glad to further strengthen our partnership with this special team by announcing Sequoia Capital India’s $11 million Series A investment in the company.  The pace of innovation over the past two years has been breathtaking and we are excited for what Last9 will achieve next on the global stage.

last9