What does it mean to be an SRE? Or why do we need SREs? What do they work on the product? Well, those questions are too vague for newbie SREs, especially for those who have never experienced sites with large traffic. To illustrate the function of SREs, we can use the analogue of building houses. If our team want to build a house for a single family, we can build a house which is similar to one we built before. After a few years, the owner wants to add an AC system or to add a kid’s place to the backyard. Moreover, the maintenance of the house could be an issue in future, since winds, storms or hurricanes can attack plants in the yard, and pipes can leak for its age. Moreover, more and more people ask our team to build the same houses! SREs would say: “Well, we can make it and and we should have made sure that the quality is good when introducing new features or scaling our service.” That’s who SREs are; they are working on the side where external users cannot see to make the product run smoothly.
You can say SREs do not sound cool, because they work exclusively internally and cannot show their work. That is not true. Internal users are also “users”; and SREs show their work on metrics, monitor systems, automation systems and counters and logs accumulated daily. The code performance is as important as its functionality. Users do not have any desire to proceed if the site is down or is slow.
Does it mean that SREs only need to know system administration besides some software engineering? What knowledge and skills should an SRE have before starting? That is a broad topic. Let’s break down their responsibilities at first.
The responsibility that comes the first is to keep the system run and available. If you have one blog system which you visit to write your daily journals, do you care if you can do it at any time. It looks fine since you only deploy it on your own machine and store everything there. Wait, what if the hard drive crashed? What if other processes used too much resources and the blog system was down? Making the system available to users is the first responsibility for SREs, who should solve those “what if” problems.
The second one is to automate and scale. Let’s continue the story of our blog system. Then you find the blog system pretty nice and you would like to share it to everyone around the world. But how? What about deploying it to multiple machines so that everyone can write their own journals on a single machine’s blog system? Would it work? Wow, that might be too expensive. What if we can make one machine accessible to multiple users so everyone can have their own blog system virtually? If one machine has reached its capacity, what can SREs do to make other users happy? Is it challenging to automate the deployment on 1 million machines? Because the blog system is so beautiful and it is free, everyone on the earth is trying to access it and sometimes finds it slow to visit. Can we move some machines to the other side of the earth? Is that crazy? SREs would design a good way of load-balancing and make everyone access the blog system as fast as they can.
The third one is to respond to emergencies. Everything can happen, for example, one machine is totally down when 1,000 users are accessing it at the same time and the load-balancer leads them to other machines. The emergency is cascading and that could affect more than 1,000 users who like writing journals! The most challenging part is to trace the root cause and fix it. Calm down, take a deep breath and take a look at the code and the monitor metrics; SREs will figure it out. After fixing it, the best we can do is to figure out not to make the same mistake again. Otherwise, the same history will create another emergency.
To make the lovely blog system seen to everyone, SREs have to take care of design, deployment and operations. The knowledge scope is broad, but defining the problem and applying the corresponding knowledge and skills is the key.
The mindset of SREs is different because the first goal is to deliver a reliable system. How do you know if you would be the right person for SRE? It depends on what you care most. For example, do you care how to implement a feature supporting multiple users online? Or do you care how the traffic from multiple online users can affect the site? Do you care everyone can get consistent experience at any time? When an emergency happens after the system launches, can you stay calm and fix it? Do you want to trace the root cause? If last four answers are “yes”, you might consider being an SRE.