To draw a line between DevOps and Site Reliability Engineering does not make much sense, because one derives its purpose from the other. Each process evolved independently as a means to similar ends: ways to provision highly reliable systems. Interestingly, while Site Reliability Engineering has been rapidly gaining ground in recent years as a practice of value for tech-centric companies, it has been around as a named concept since the early 2000s. DevOps and Site Reliability should not be considered in terms of «one or the other», but as DevOps being the wider philosophy on building and releasing products, and SRE being a part of this philosophy with a focus on key business metrics.
What is DevOps culture in relation to Site Reliability Engineering?
Strip away tools and tech, and the core of DevOps is exposed as a way to bring development and operations teams together — a portmanteau in the most literal sense. To internalise a DevOps culture, teams must shift their thinking toward a shared vision that combines the best of Development and Operations: constant innovation in development, based on constant monitoring of metrics and reliability learned in operating solutions. DevOps has some implicit rules for success: teams must cooperate closely in development, and then monitor measure, improve, and finally drive efficiency in the development process based on collaboration with the wider business.
Site Reliability is a way of realising these rules in a specific manner depending on business domain and metrics. Companies of all shapes and sizes adopt DevOps in their own fashions — cloud vs. on-premise, differing domains, platform size and countless other aspects impacting the shape DevOps takes for each. Site Reliability Engineering works well for companies that have high-traffic systems that cause significant business impact when down.
What is a Site Reliability Engineer?
Google created the Site Reliability Engineer role in its teams to manage services like this: its search engine is the most understandable use case for the approach. Even in the early 2000s, the platform was experiencing huge and constant volumes of traffic in many formats from across the world. Google needed to ensure availability of its search engine (and many other behind-the-scenes services) to a near-perfect percentage, but it also needed to develop new features for the platform at a fast pace. To bring these two goals together, the SRE role was born.
DevOps engineers focus on implementing stability, optimisation and resiliency of systems so that they can be improved faster with less risk. Their mission is to make the delivery and execution of software products easier; automated and maintainable. This involves responsibilities from setting up logging and monitoring solutions to creating continuous deployment mechanisms. They are flexible by nature in relation to the needs of their development teams or groups — this is vital, as good DevOps engineers will guide teams toward better processes and implement optimisations along the way.
The Site Reliability Engineer role is centred on a specific way of working that is directed toward the same goals as DevOps but uses different approaches to achieve them. Primarily, SREs are laser-focused on business metrics and system performance. That’s not to say that DevOps engineers are not — business metrics are critical to DevOps. The difference is that Site Reliability Engineers focus on defining and upholding highly specific metrics for system performance in any way they can, as their primary function. The SRE role comes into its own in relation systems which require a defined level of reliability and availability.
Key Aspects of Site Reliability Engineering
Upholding service level agreements (SLAs) for systems is what defines a Site Reliability Engineer’s main task. SLAs are baseline metrics for success — in essence, what is required to provide a great service to end-users. SREs must work closely with business stakeholders to create indicators for good performance — not technical jargon, but stats which directly affect business success, like traffic, error rate or user capacity.
Down the line, the SRE will use this information to communicate best practices surrounding releases with the engineering team. For example, if deploying a new version of the application increased error rate by 1%, which is above the defined maximum error rate, the SRE knows that this release needs to be rolled back and tested before redeployment. With a good monitoring setup, they can analyse multiple stats in unison to decide the best course of action depending on pre-defined SLAs.
Possibly the most important technical quality of a Site Reliability Engineer is their expertise in finding the root cause of problems, that requires a good understanding of coding and debugging. This is prized because they can apply their detailed operational knowledge of their platform and its performance metrics to developing solutions that improve that platform. They are uniquely suited to building tools that improve system performance. In fact, they are incentivised to do so, because consulting development teams makes up half of their job, with the other half being responsibility for performance. As a result, wider development teams benefit from the deployment efficiencies introduced by the SRE — and features can be delivered faster, with less operational risk.
The other side of SRE is all about the day-to-day health of systems. Taking a pragmatic approach to operational performance is vital. Good examples can be found in the retail domain — for example, some systems require 24/7 on-call monitoring due to global traffic. Godel teams handle this between DevOps engineers and software engineers, which adds shared responsibility and visibility of system performance for all team members.
As you can tell from the scope of desirable skills outlined here, a Site Reliability Engineer must possess great skill and experience in fields beyond engineering — they must be able to interpret the business context of operations, too. They need to communicate with stakeholders about the expectations of system performance in terms of business availability and identify best practices with engineering teams to achieve those metrics in a reasonable time frame.
How does Godel approach Site Reliability Engineering?
Godel’s DevOps division is learning and growing every day for its clients. The team has implemented solutions across domains from retail to automotive and worked on systems as large-scale as comparethemarket.com’s. This is combined with a company-wide dedication to «product mindset» which upholds a focus on the client’s business goals, product vision and stakeholder needs as a top priority for every Godel team member.
Ultimately, Godel’s DevOps division understands the impact of system performance on its clients’ end-users and takes responsibility for introducing solutions that drive efficiency and resiliency across these systems. Site reliability engineering practices are often naturally applied as part of the product-led approach Godel teams take. This adds value across the entire engineering team. Knowledge is constantly shared between Godel and the client as the partnered teams learn best practices together and gain a better understanding of how to improve platforms in the context of end-user needs.
The best resource for understanding Site Reliability Engineering is by its creator, Google. You can read the full book here: https://landing.google.com/sre/sre-book/chapters/foreword/