In today’s dynamic socio-economic climate and advanced state of technological capabilities, consumers’ persistent demand for new features and uninterrupted service results in increased reliability and security risks.
Modern technology-driven organizations need to build or expand their SRE teams to address those risks, drive innovation, and maintain competitiveness. Still, each company is different and has specific needs. Building an SRE team from scratch is not the same as integrating an already established group that requires someone to lead the cultural and organizational shift.
The role of a founding SRE is a challenging spot to fill. Given the complexity of the market, knowing what to look for in a candidate is difficult. From the level of experience to knowledge of modern tools and DevOps and SRE practices, what exactly makes an engineer a successful first hire?
As more businesses make SRE hiring a top priority, we asked five industry experts what they think it takes to become a stellar founding SRE. Interestingly, there seems to be common ground among their opinions: a stellar first infrastructure hire is an experienced engineer with demonstrated critical thinking and a highly adaptive nature; a natural leader, the founding SRE, embodies the DevOps culture.
According to Tina Huang, Founder and CTO of Transposit, “The role of a founding SRE is as much about building out a reliability program as it is about fighting fires; this includes everything from hiring to choosing the right tools to helping get your company compliance certifications like SOC2. It can be helpful to select for your SRE team engineers who, even if they haven’t built from scratch the program themselves, have been early enough at a company's SRE function that they’ve seen this early groundwork laid.
SRE at a large organization has very different constraints and challenges. While many of those skills are transferable, the first hire should be someone with domain expertise for what your organization needs at that time and has the early SRE pragmatism to balance the many different demands of starting this new function in your organization.”
Ray Myers, Principal Site Reliability Engineer at divvyDOSE, emphasizes the importance of critical thinking above technical skills and having experience in navigating the vast tool ecosystem. “First off, you want command of the literature but don’t expect to simply drop a particular SRE topology into an existing organization. Start by observing. What are the pain points? What SRE practices are most valuable to introduce first, and how can they hook into the current processes?
Social skills are key, especially persuasion and listening. In order to influence how teams work, you need to earn their trust. Learn their motivations. Demonstrate that your agenda makes their lives easier.
Of course, technical skills, like automation, cloud, monitoring, and security, are important. The specific technologies are less important than having applied a variety of solutions and being able to navigate the vast tool ecosystem.
For Dai Shi, Founding Site Reliability Engineer at Rockset, the role of a founding SRE starts with setting the grounds “for how SRE will interact with the rest of the company. Building relationships with other teams and forming how they view SRE is crucial to the success of the future SRE team.
In order to accomplish our mission of achieving maximum developer velocity while still maintaining our SLA, we have to rely on other teams to understand their role in achieving good reliability and to work together with SRE; this can come in the form of a product or sales team notifying SRE early on that a large high-touch customer is being onboarded, which can drastically increase the proportional load on the system (especially during early days where the overall load is still small). It can be developer teams building in the instrumentation of crucial metrics into a microservice and building dashboards to monitor it before launch. Finally, it can be having all the engineering teams on call for their own services instead of relegating the SREs to be the only ones holding the pager.
The last example I’ll give is ensuring that management supports SRE’s ability to draw the line when reliability becomes poor and allows them to do things such as delay launches and suspend deploying new features until reliability issues are fixed. Getting these things right can be much more challenging than technical stuff and is much harder to change later if you don’t get it right.”
According to Geoff Howland, Principal Site Reliability Engineer at Observe, the work of a founding SRE begins with several simultaneous goals: “initial post-developer configuration, initial automation plan, initial SRE documentation, keeping development velocity high, an initial culture of responsibility and communication.
Depending on how well you do these things, your company can have a smoother or rougher launch into a larger customer base.
There will almost always be an initial developer configuration (they may have chosen Kubernetes, a container deployment, or ansible/chef).
Starting out, you will need some things to be automated: CI/CD (to lesser or greater extent CD), tests, deployment process, getting all metrics/logs ingressed somewhere. Then, you will need to audit these and figure out what can be fixed later and what you have to prioritize immediately.
You need to start documenting as soon as you can. You will be setting the standard for documentation on both new SRE hires and developers who help out with documentation. How clean you make your organization and layout will determine if you have a jungle or more of a farm of information.
Don’t try to be a 20-year company in the 2nd or 3rd year! You need to talk with the founders and determine how outage and failure sensitive you are as you progress because, in the early stages, development velocity must stay high, even though it may cause more outages. Always be looking for when the balance tips. Start being more cautious and putting on more guard rails and restrictions and communicate and listen to how this changes development velocity.
In terms of culture, you are setting it going forward. How you act will determine how people think about SRE. When you make a mistake, immediately admit to it, deal with things as problems to be solved. Document them clearly, and assign owners, not blame, but describe what happened clearly and who was involved so everyone can learn. As you grow, processes will grow, but you need to determine if that becomes restrictive or helpful in the nuance of your decisions. The earlier you start, the more impact you can have on long-term culture.”
Shane Bostick, Principal Owner at bstk studio, argues that, from a technical standpoint, a founding SRE should demonstrate a capability to design and bring up foundational infrastructure from scratch in a secure manner.
“Given the huge diversity of application stacks, infrastructure architectures, and tooling solutions available, a founding SRE should understand the breadth of the technical landscape and how to work through unknowns to make decisions. Many decisions will need to be made along the way, and reaching consensus is often an unproductive way to converge. Know when to leverage a reference architecture or turn-key solution, raise infra/operational concerns, and have foresight into cross-functional dependency chains, distributed ownership.
As a founding Site Reliability Engineer, you should be able to make strategic considerations, such as:
Enable self-serve for dev teams to own code-quality/coverage/production-readiness of their apps/services, and know when to leverage an opinionated framework, architecture, and learn how to stay tool/cloud-agnostic. Design for any engineer to understand, inherit, troubleshoot, and extend.
Know when to break the rules. Experience enables one to bound unknowns, prototype, and evaluate trade-offs quickly; it makes one aware that sometimes a proven design or pattern pays much bigger dividends over time.
A founding Site Reliability Engineer should think in terms of:
Accurately estimating complexity and effort takes experience, and design-by-committee and consensus finding can be prohibitively slow and ineffective. So instead, solve at breadth when possible and distribute ownership where it makes sense.”