Senior Site Reliability Engineer (SRE)

Stacklok
Full_time$156k-198k/year (USD)Bellevue, United States

📍 Job Overview

  • Job Title: Senior Site Reliability Engineer (SRE)
  • Company: Stacklok
  • Location: Bellevue, Washington
  • Job Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
  • Category: DevOps, Infrastructure
  • Date Posted: June 25, 2025

🚀 Role Summary

  • Design, build, and operate the infrastructure that powers Stacklok's products and services, ensuring reliability, scalability, and security.
  • Collaborate with cross-functional teams to deliver secure, scalable infrastructure for real-world AI use cases.
  • Evolve cloud-native systems using Kubernetes, Terraform, and ArgoCD, automating deployments and incident response.
  • Define and maintain key metrics, using telemetry and observability tools to proactively detect issues and drive systemic improvements.
  • Champion operational excellence by establishing and iterating on SLOs, incident response, and on-call practices.

💻 Primary Responsibilities

🔧 Design and Operate Reliable Infrastructure

  • Contribute to the evolution of Stacklok's infrastructure by designing and managing production systems that support multiple engineering teams.
  • Continuously improve platform performance, availability, and operational robustness through well-engineered solutions.

🤖 Automate Operational Workflows

  • Apply an automation-first mindset to reduce manual processes in provisioning, deployment, and incident response.
  • Deliver resilient tooling and workflows that enable faster delivery and improve reliability.

📈 Monitor and Improve Service Health

  • Define and maintain key metrics that reflect system performance and reliability.
  • Use telemetry and observability tooling to proactively detect issues and drive systemic improvements.

🏆 Champion Operational Excellence

  • Establish and iterate on SLOs, incident response, and on-call practices that ensure reliable service delivery.
  • Promote a culture of accountability, preparedness, and continuous improvement.

👩‍🏫 Mentor and Enable Engineering Teams

  • Share production knowledge, write and maintain high-quality runbooks and system documentation, and support engineers in adopting sound operational practices.
  • Contribute to a healthy, inclusive engineering culture through mentorship and collaboration.

🎓 Skills & Qualifications

🎓 Education

  • Bachelor's degree in Computer Science, a related field, or equivalent experience.

🕒 Experience

  • Proven experience (5+ years) in Site Reliability Engineering or a similar role.
  • Experience with programming languages such as Python, Go, Bash, or similar, with an emphasis on clear structure, testing, and operational reliability.

🛠 Required Skills

  • Strong foundation in Site Reliability Engineering (SRE).
  • Proficiency in Infrastructure as Code (IaC) tools, such as Terraform.
  • Hands-on experience with Kubernetes and Docker in production environments.
  • Familiarity with cloud-native architecture patterns and cloud provider experience (AWS preferred).
  • Experience with GitOps practices and deployment tooling, such as ArgoCD.
  • Proficient with log aggregation and telemetry tools, such as AWS CloudWatch, Prometheus, Grafana, or similar.
  • Experience defining and using SLOs and KPIs to guide reliability goals and improve service quality.
  • Strong written and verbal communication skills, with the ability to collaborate across technical and non-technical audiences.

🌟 Preferred Skills

  • Experience with incident response automation tools, such as PagerDuty.
  • Familiarity with operational and infrastructure security best practices.
  • Track record of delivering technical solutions that drive measurable business outcomes.
  • Experience working in a fast-paced, high-growth startup environment.

📊 Web Portfolio & Project Requirements

📂 Portfolio Essentials

  • A portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure.
  • Examples of infrastructure as code (IaC) using Terraform or similar tools.
  • Demonstrations of cloud-native deployments and incident response automation.
  • Case studies highlighting your ability to improve service health, reduce toil, and drive systemic improvements.

📄 Technical Documentation

  • Well-documented runbooks, system diagrams, and process flows that demonstrate your ability to communicate complex technical concepts clearly and concisely.
  • Examples of code quality, testing, and operational best practices in your portfolio projects.

💵 Compensation & Benefits

💰 Salary Range

  • $156,000 - $198,000 per year

🎁 Benefits

  • Equity
  • Comprehensive healthcare
  • Flexible work environment
  • Flexible PTO

🎯 Team & Company Context

🏢 Company Culture

  • Industry: AI-first company focused on enterprise developer tools and agentic AI systems.
  • Company Size: Small to medium-sized, fast-growing startup.
  • Founded: 2021
  • Team Structure: Cross-functional teams with deep experience in open source, cloud-native technologies, security, and developer tools.
  • Development Methodology: Open, collaborative, and community-driven, with strong roots in open source and cloud-native technologies.

📈 Career & Growth Analysis

  • Web Technology Career Level: Senior Site Reliability Engineer (SRE)
  • Reporting Structure: Reports directly to the Director of Engineering or a similar role, collaborating with multiple engineering teams.
  • Technical Impact: Own key production systems, lead reliability-focused engineering efforts, and drive systemic improvements in service health and reliability.

🌐 Work Environment

  • Office Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
  • Office Location(s): Bellevue, Washington (planning to relocate to a more central location)
  • Workspace Context: Collaborative workspace with a focus on in-person interaction for three days a week, balancing flexibility with the value of in-person collaboration and community.

🛠 Technology Stack & Web Infrastructure

🛠️‍💻 Frontend Technologies

  • Not applicable for this role

💻 Backend & Server Technologies

  • Kubernetes: Proficiency in designing, deploying, and managing applications using Kubernetes.
  • Terraform: Experience with Infrastructure as Code (IaC) using Terraform or similar tools.
  • ArgoCD: Familiarity with GitOps practices and deployment tooling using ArgoCD or similar tools.
  • AWS: Proficiency with at least one major cloud provider, with AWS experience preferred.

🛠️ DevOps Tools

  • Git: Proficiency in using Git for version control and collaborative development.
  • PagerDuty: Experience with incident response automation tools like PagerDuty.
  • Prometheus & Grafana: Familiarity with log aggregation, telemetry, and monitoring tools such as Prometheus and Grafana.

👥 Team Culture & Values

🌟 Web Development Values

  • Reliability: Prioritize reliability and availability in all aspects of infrastructure design and operation.
  • Scalability: Build infrastructure that can scale with product adoption and user demand.
  • Security: Implement and maintain strong operational security practices, including secure software supply chain considerations.
  • Collaboration: Foster a culture of collaboration, mentorship, and knowledge sharing across engineering teams.
  • Continuous Improvement: Embrace a mindset of continuous improvement, driving systemic changes based on data-driven insights and user feedback.

🤝 Collaboration Style

  • Cross-functional Integration: Work closely with product, design, and other teams to deliver reliable, scalable infrastructure that meets business needs.
  • Code Review Culture: Participate in code reviews and contribute to a culture of shared responsibility and engineering excellence.
  • Mentorship and Knowledge Sharing: Share production knowledge, write and maintain high-quality runbooks, and support engineers in adopting sound operational practices.

🌱 Challenges & Growth Opportunities

🌱 Technical Challenges

  • Infrastructure Evolution: Design and implement scalable deployment ecosystems using Terraform and Kubernetes, embedding security and operational best practices from the outset.
  • Automation and Reliability: Deliver automation across provisioning, deployment, recovery, and operational workflows, significantly reducing manual effort and operational risk.
  • Service Health Optimization: Define and implement meaningful SLOs and KPIs tied to service health and business goals, driving optimizations and cost reduction.

🌱 Learning & Development Opportunities

  • Technical Skill Development: Stay up-to-date with emerging technologies, tools, and best practices in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
  • Leadership and Mentorship: Develop leadership skills and mentor less experienced engineers, fostering a culture of shared responsibility and engineering excellence.
  • Architecture Decision-Making: Gain experience in architecture decision-making, contributing to the design and evolution of Stacklok's infrastructure and platform.

💡 Interview Preparation

💡 Technical Questions

  • Kubernetes and Terraform: Demonstrate your ability to design, deploy, and manage applications using Kubernetes and Terraform, with a focus on scalability, reliability, and security.
  • Incident Response: Describe your experience with incident response automation and proactively detecting issues using telemetry and observability tools.
  • SLOs and KPIs: Explain your approach to defining and using SLOs and KPIs to guide reliability goals and improve service quality.

💡 Company & Culture Questions

  • AI-First Company: Demonstrate your understanding of Stacklok's AI-first mission and how your role as an SRE contributes to the company's goals.
  • Collaboration and Communication: Showcase your ability to collaborate effectively with cross-functional teams and communicate complex technical concepts clearly and concisely.

💡 Portfolio Presentation Strategy

  • Infrastructure Case Studies: Highlight your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
  • Code Quality and Documentation: Demonstrate your ability to write clean, well-documented code and maintain high-quality runbooks and system diagrams.

📌 Application Steps

To apply for this Senior Site Reliability Engineer (SRE) position at Stacklok:

  1. Submit your application through the application link provided.
  2. Customize your resume to highlight your relevant experience in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
  3. Prepare a portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
  4. Research Stacklok's AI-first mission and company culture, and be prepared to discuss how your skills and experience align with the company's goals.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.


Application Requirements

Candidates should have a strong foundation in SRE and experience with programming, particularly in languages like Python or Go. Deep experience with Terraform, Kubernetes, and cloud-native operations is essential, along with a track record of delivering technical solutions that drive measurable business outcomes.