Site Reliability Engineering and the Art of Improvisation – The New Stack


Matt davis

Matt is a Senior Infrastructure Engineer at Blameless. His expertise includes data center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems and TechOps leadership. A graduate in musical interpretation and composition, he is passionate about exploring the relationships between the artistic spirit and the exploitation of distributed software architectures.

Site Reliability Engineering (SRE) is based on orchestration and improvisation. Developing a great SRE practice means a deep understanding of the technical infrastructure, but also the confidence to trust your gut and just start scrambling.

I host a weekly continuous learning session at Blameless which takes its title from the traditional Indonesian orchestra: the Gamelan (pronounced “gah-meh-lahn”). This orchestra is mainly composed of percussion, many tuned gongs and mallets, a few string and wind instruments, usually a male or female singer, all brought together by the rhythm and the writing of improvised songs.

You see, a key part of gamelan is that the music is written by the band as it is practiced, with the belief that the music must grow and change. As they meet, the members continually develop new versions of the songs each time they play. Practices start to look like performances and vice versa. It’s a bit like improvising jazz, where the concert is just another time to come together and play.

Here are some of the ways we transform and evolve our understanding during these sessions:

  • Presentations of observability toolsets, aka “Morning Vistas”: What do you observe when you open the laptop to start the day and examine the operational landscape? This offers new perspectives on how our colleagues approach their regular work.
  • Decision requirements table building, for example the most difficult decisions encountered during on-call or live maintenance of our Kubernetes clusters. These help us think about how we can make improvements to help stakeholders make decisions under duress.
  • Team knowledge elicitation, like deeper views of NGINX Ingress logging or attempting a dependency matrix for our critical path. It is very useful to extract some of this juicy knowledge from the brains of our experts.
  • Ask the question “Why do we have custody?” “ share mental models of how different people in the company perceive and engage with it. We learn about everyone’s expectations, how we might alleviate fears of being on call for the first time.
  • Spin the wheel of expertise! alias “Who? What? Where? ”Here we explore our tech stack and our services through gameplay, asking each person to spin the wheel and asking them to show us firsthand how they would come up with the answer, or how it would escalate. if she just didn’t know.
The wheel of expertise

Spin the wheel of expertise

What we have created at Blameless is a learning opportunity and a time to come together collaboratively to share mental models and tell stories about different areas of the system in a safe and pressure-free way so that we can continue to learn. . In this way, incidents are also just another time we can apply our powers of intuition because we have practiced techniques to resolve them. Specifically, we call it “The practice of the practice,” which is the experience we take in when we actually practice our craft – improvisation, production, incidents.

My motto has sometimes been that it doesn’t matter what we do together as long as we do it together. Regardless of the participation, discussions always immerse in shared perspectives and provide a safe space for participants to explore things without fear of the judgment or anxiety associated with an incident. It is impossible for one person to know all the complexity of networked software, so it becomes essential to know where to find the expertise and how to learn from the practice instead of trying to follow hastily revised prescriptions or runbooks. .

One of my favorite things about managing these learning opportunities is seeing participants use aspects of their regular work as we answer questions or explore one user interface or another. This allows others to glance at the mental models of their colleagues. What may seem like mundane and mundane tasks to one can enlighten another’s understanding, or even lead others to embellish their own models and styles of work.

Socio-technical praxis

Our themes and agendas are somewhat loose but usually planned out, so we don’t just look at each other. However, sometimes we have to adapt. A session occurred on the same day as a major vendor outage that disabled our ability to use part of our own UI to support that day’s game. So we pivoted, and it became a session with two of our experts on the subject of vendor failure, which in this case was root CAs and SSL / TLS.

While the focus is on the operational parts of our complex system, the participants are far from being just infrastructure engineers and SREs. We have sessions that include people from technical writing, software development, customer service, strategy, marketing, and even management. We make the calendar invitation optional, company-wide and we don’t call it a meeting: it’s a session, where we can share stories and have fun in a live setting.

Video call

A session with members of different teams

In all of these activities, we seek to open doors that people might be afraid to go through, learning by experiencing how our peers respond to questions about a service or technology. We pick up on the patterns and praxis of others, which enriches our own set of intuitive responses, creating new pathways and connections in our own mental models. This enriches our view of the system and provides the basis for adapting when responding to incidents.

Build to adapt

In the great sociotechnical scheme of things, “the Practice of the Practice allows us to build on the resilience that flourishes like the harmony of well-trained jazz musicians. The magic and excitement found in the discovery is food for our brains. Our synapses crave rewarding pattern recognition, combining new experiences with old and other mental models to form new ones.

The superhero power to instantly pull solutions seemingly out of nowhere originates in bringing our scales, melodies, theories, rhythms and other practiced patterns together in inspiring combinations.

Instead of enduring the stressful common ground failures during incidents that result in a bad customer experience, we’re looking for new ways to choreograph our socio-technical systems with more confidence. We see as an organization that there is power in this type of collaboration; participants hailed these sessions as one of the best on-the-job learning they have ever done.

So it’s true that having a better grasp of how to deal with ambiguity rather than avoiding ambiguity comes directly from knowing how to do our job better in the end. But we are not alone. To do this, we draw on our rich network of humans in joint collaborative activities, recognizing how our regular work interrelates and fuels the very complexity we seek to understand.

It’s not much different from the way musicians influence and support each other through their playing. Imagine how extremely uncomfortable events can be alleviated by an unpretentious session about the choices you have when your servers are at your service. very reliable fail. Incidents are unplanned and therefore can be intimidating, but the team is there to support you. It’s a situation you’ve all been into, so it’s just another time when you come together to make music.

Featured Image Via Pixabay.


Comments are closed.