What is SRE (site reliability engineer)?

Définition du SRE — What is an SRE (Site Reliability Engineer)?

By Gologic with the collaboration of Alexandre Couëdelo.

Site Reliability Engineer (SRE) is a position that is gaining a lot of traction on the market, but for many, the concept raises many questions. What is the responsibility of an SRE? Should your organization adopt the SRE model? How is SRE related to DevOps?

This article will give you a clear idea of this not so novel concept and let you appreciate how important it is to succeed in delivering frequently reliable software and ensure the best user experience without burning out your development teams.

SRE Definition

The SRE model was created at Google in 2003, when Ben Treynor Sloss joined the Silicon Valley company and set up the first SRE team. But what exactly is an SRE? SRE is a software engineering approach that uses software to manage systems, solve problems and automate operations-related tasks. SRE is an effective method for developing scalable and highly reliable software systems.

What is the role of SRE in a company or a

team?

The one and only “official” responsibility of an SRE is to ensure that the system is reliable, meaning that the company meets its objectives translated in SLA (service level agreement). SLAs are contractual objectives established between the customer and the supplier allowing SLOs to be formalized as business objectives and which lead to penalties if they are not respected.

SLO (service level objectif) are an internal measure for the performance of the systems. An example of SLO would be “95% of users can log in under 100 ms” or “99.9% of HTTP requests are successful”. And, to know if the SLOs are achieved, the SLIs are set up to measure themselves. An example of SLI would be “Feature response time in ms”.

SLOs are extremely efficient tools that prevent alert fatigue and have engineering teams focus on what matters to provide value to the end-user. Creating SLOs means adopting an error budget-based approach, meaning that as long as you have not consumed your budget you should not let yourself be interrupted by some metric degradation.

As long as SLOs are respected (no fire fighting required) an SRE will focus on bringing a concrete implementation of the DevOps. This means making sure all the right tools are in place to create a DevOps delivery (infinite) cycle and collaborating with operation teams to make sure the underlying infrastructure and security are in place. The SRE teams will be specialized first in tooling and CI/CD best practices. The challenge for new SRE would be to wrap their head around the numerous tools on the market.

What is the daily routine of an SRE?

SRE will dedicate usually 50% of its time supporting the tooling infrastructure, improving the processes by coding pipelines and/or automation scripts. It also involves giving coaching and writing documentation to help developers reach the required quality level to deploy in production.

SRE is a gatekeeper for the production environment and as such a responsibility to set the requirements before any deployment. Monitoring and operation of the system in production will be another 50% of the SRE time. They must evaluate the risk of a release and can veto any deployment if the risk is too important. For the collaboration with development teams to happened smoothly, it is essential that SRE participate in the early stages of product development to identify potential hazards early on in production. This basically means taking the time to explain to the business the rules of the game by including them in the definition of SLO and the implementation of software monitoring once in production (monitoring).

SRE vs Devops: what is the difference?

Before talking about what is the role of SRE let’s have a look at where in the organization you will find SRE. The best illustration is in the book DevOps Topology by Matthew Skelton and Manuel Pais. If you don’t own the book yet, no worries their website contains a summary of all the DevOps topology. Yes, there are several ways to work in mode DevOps and SRE is one of them.

In the SRE model development, (Dev) and operation (OPS) are separated, however, the dev no longer throws software at OPS. In fact, both Dev and Ops are ‘hand-offs’ from running the software.

In this model, you have a dedicated team(s) facilitating collaboration and cooperation between Dev and Ops teams. SRE’s main objective is reliability, as such their goal is to set the baseline and the necessary automation to deploy and operate software in production. SRE team usually brings together people with different backgrounds such as system administrators, build masters, and developers. Operation automates and maintains the infrastructure that will support the organizational need. The Ops contribution in most of the cases can be summarized as creating and maintaining a PaaS (Platform as a Service) for Devs to use in the delivery of the platform. The core idea is that Operation should be considered as a service.

SRE & DevOps, recent models?

As written above, SRE model was invented at Google, when Ben Treynor Sloss joined the organization in 2003, he founded an early SRE team. As a reference, the term DevOps was officially coined in 2008 which lead to the first DevOpsDays conference in Belgium in 2009. As you can see the SRE model was invented and put in place 5 years before we even started talking about DevOps. As a result, DevOps and SRE differ in many ways. For instance, SRE does not advocate the famous “You build it you run it”. On the other side, SRE implements some of the DevOps philosophy and even solves important issues of the basic DevOps model as we will discuss in the next section.

“ If you think of DevOps as a philosophy and an approach to working, you can argue that SRE implements some of the philosophy that DevOps describes, and is somewhat closer to a concrete definition of a job or role than, say, “DevOps engineer.” So, in a way, class SRE implements interface DevOps.”
The Site Reliability Workbook: Practical Ways to Implement SRE

Why you may need SRE engineers?

What problem does SRE solve in the DevOp models? The simplest models – Type 1 and Type 2 from DevOps topologies have their limits when it comes to big companies. These models don’t scale well, with more and more products and technical creating a true Dev-Ops collaboration becomes difficult.

We have seen the issue with Agile/SCRUM where it works really well for a couple of teams, but once you had a dozen of SCRUM teams that started to have despite them dependencies, keeping all backlogs aligned with organizational needs could become a hell. One of the solutions for big companies was to adopt SAFe (Scaled Agile Framework) a tool specifically designed to overcome Agile drawbacks.

SRE solves a scalability problem for the DevOps model. When you have many teams working in Agile/DevOps applying « You build, it you run it », if the products they are developing are somewhat interconnected, Teams A’s product is consumed Teams’ B services based on other teams’ API. You ultimately have intricate layers of services delivering an application to your clients.

But what prevents the castle from collapsing? How do we ensure the reliability of our system despite its complexity? How do we make sure that all the teams and microservices we are implementing serve best the interests of the end-user? Do all teams follow the same quality standard? Could a team with a high-quality standard be affected by other teams’ lower expectations?

You should see where are am going to, in order for DevOps to scale and guarantee the best service level for the end-user, you need people to coordinate all this complexity and make sure of the quality of the delivery process. This is what the role of SRE is all about.

Conclusion

SRE is a Software Management Model that was created at Google a few years before DevOps was officially coined. Since it implements some of the DevOps principles and is more scalable than the simplest DevOps model, it has quickly gained ground in medium to large companies.

The main responsibility of an SRE is to make sure the system is always delivering the same quality of service to the end-user. SRE’s best weapons are called SLOs, which is a statistical indicator put in place to measure the reliability of the software and indicate when to intervene to fix issues.

The SRE also takes part in business by informing the parties involved about the performance of their product. It is the major difference with a system administrator. The SRE plays a more front-stage role, directly interpreting technical data into business data!

Since SRE are responsible for the production environment, they have to set quality requirements for development teams to prevent risky deployments. This concretely translates into the creation of an automated pipeline and the use and maintenance of the numerous tools that compose the software supply chain.

Do you have any interest in the role of SRE? GO, join us! We have a role for you.

By Gologic with the collaboration of Alexandre Couëdelo

Suivez-nous et partagez

Blogue

What is an SRE (Site Reliabilty Engineer)?