Senior Site Reliability Engineer (SRE)
Stacklok
Full_time•$156k-198k/year (USD)•Bellevue, United States
📍 Job Overview
- Job Title: Senior Site Reliability Engineer (SRE)
- Company: Stacklok
- Location: Bellevue, Washington
- Job Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
- Category: DevOps, Infrastructure
- Date Posted: June 25, 2025
🚀 Role Summary
- Design, build, and operate the infrastructure that powers Stacklok's products and services, ensuring reliability, scalability, and security.
- Collaborate with cross-functional teams to deliver secure, scalable infrastructure for real-world AI use cases.
- Evolve cloud-native systems using Kubernetes, Terraform, and ArgoCD, automating deployments and incident response.
- Define and maintain key metrics, using telemetry and observability tools to proactively detect issues and drive systemic improvements.
- Champion operational excellence by establishing and iterating on SLOs, incident response, and on-call practices.
💻 Primary Responsibilities
🔧 Design and Operate Reliable Infrastructure
- Contribute to the evolution of Stacklok's infrastructure by designing and managing production systems that support multiple engineering teams.
- Continuously improve platform performance, availability, and operational robustness through well-engineered solutions.
🤖 Automate Operational Workflows
- Apply an automation-first mindset to reduce manual processes in provisioning, deployment, and incident response.
- Deliver resilient tooling and workflows that enable faster delivery and improve reliability.
📈 Monitor and Improve Service Health
- Define and maintain key metrics that reflect system performance and reliability.
- Use telemetry and observability tooling to proactively detect issues and drive systemic improvements.
🏆 Champion Operational Excellence
- Establish and iterate on SLOs, incident response, and on-call practices that ensure reliable service delivery.
- Promote a culture of accountability, preparedness, and continuous improvement.
👩🏫 Mentor and Enable Engineering Teams
- Share production knowledge, write and maintain high-quality runbooks and system documentation, and support engineers in adopting sound operational practices.
- Contribute to a healthy, inclusive engineering culture through mentorship and collaboration.
🎓 Skills & Qualifications
🎓 Education
- Bachelor's degree in Computer Science, a related field, or equivalent experience.
🕒 Experience
- Proven experience (5+ years) in Site Reliability Engineering or a similar role.
- Experience with programming languages such as Python, Go, Bash, or similar, with an emphasis on clear structure, testing, and operational reliability.
🛠 Required Skills
- Strong foundation in Site Reliability Engineering (SRE).
- Proficiency in Infrastructure as Code (IaC) tools, such as Terraform.
- Hands-on experience with Kubernetes and Docker in production environments.
- Familiarity with cloud-native architecture patterns and cloud provider experience (AWS preferred).
- Experience with GitOps practices and deployment tooling, such as ArgoCD.
- Proficient with log aggregation and telemetry tools, such as AWS CloudWatch, Prometheus, Grafana, or similar.
- Experience defining and using SLOs and KPIs to guide reliability goals and improve service quality.
- Strong written and verbal communication skills, with the ability to collaborate across technical and non-technical audiences.
🌟 Preferred Skills
- Experience with incident response automation tools, such as PagerDuty.
- Familiarity with operational and infrastructure security best practices.
- Track record of delivering technical solutions that drive measurable business outcomes.
- Experience working in a fast-paced, high-growth startup environment.
📊 Web Portfolio & Project Requirements
📂 Portfolio Essentials
- A portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure.
- Examples of infrastructure as code (IaC) using Terraform or similar tools.
- Demonstrations of cloud-native deployments and incident response automation.
- Case studies highlighting your ability to improve service health, reduce toil, and drive systemic improvements.
📄 Technical Documentation
- Well-documented runbooks, system diagrams, and process flows that demonstrate your ability to communicate complex technical concepts clearly and concisely.
- Examples of code quality, testing, and operational best practices in your portfolio projects.
💵 Compensation & Benefits
💰 Salary Range
- $156,000 - $198,000 per year
🎁 Benefits
- Equity
- Comprehensive healthcare
- Flexible work environment
- Flexible PTO
🎯 Team & Company Context
🏢 Company Culture
- Industry: AI-first company focused on enterprise developer tools and agentic AI systems.
- Company Size: Small to medium-sized, fast-growing startup.
- Founded: 2021
- Team Structure: Cross-functional teams with deep experience in open source, cloud-native technologies, security, and developer tools.
- Development Methodology: Open, collaborative, and community-driven, with strong roots in open source and cloud-native technologies.
📈 Career & Growth Analysis
- Web Technology Career Level: Senior Site Reliability Engineer (SRE)
- Reporting Structure: Reports directly to the Director of Engineering or a similar role, collaborating with multiple engineering teams.
- Technical Impact: Own key production systems, lead reliability-focused engineering efforts, and drive systemic improvements in service health and reliability.
🌐 Work Environment
- Office Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
- Office Location(s): Bellevue, Washington (planning to relocate to a more central location)
- Workspace Context: Collaborative workspace with a focus on in-person interaction for three days a week, balancing flexibility with the value of in-person collaboration and community.
🛠 Technology Stack & Web Infrastructure
🛠️💻 Frontend Technologies
- Not applicable for this role
💻 Backend & Server Technologies
- Kubernetes: Proficiency in designing, deploying, and managing applications using Kubernetes.
- Terraform: Experience with Infrastructure as Code (IaC) using Terraform or similar tools.
- ArgoCD: Familiarity with GitOps practices and deployment tooling using ArgoCD or similar tools.
- AWS: Proficiency with at least one major cloud provider, with AWS experience preferred.
🛠️ DevOps Tools
- Git: Proficiency in using Git for version control and collaborative development.
- PagerDuty: Experience with incident response automation tools like PagerDuty.
- Prometheus & Grafana: Familiarity with log aggregation, telemetry, and monitoring tools such as Prometheus and Grafana.
👥 Team Culture & Values
🌟 Web Development Values
- Reliability: Prioritize reliability and availability in all aspects of infrastructure design and operation.
- Scalability: Build infrastructure that can scale with product adoption and user demand.
- Security: Implement and maintain strong operational security practices, including secure software supply chain considerations.
- Collaboration: Foster a culture of collaboration, mentorship, and knowledge sharing across engineering teams.
- Continuous Improvement: Embrace a mindset of continuous improvement, driving systemic changes based on data-driven insights and user feedback.
🤝 Collaboration Style
- Cross-functional Integration: Work closely with product, design, and other teams to deliver reliable, scalable infrastructure that meets business needs.
- Code Review Culture: Participate in code reviews and contribute to a culture of shared responsibility and engineering excellence.
- Mentorship and Knowledge Sharing: Share production knowledge, write and maintain high-quality runbooks, and support engineers in adopting sound operational practices.
🌱 Challenges & Growth Opportunities
🌱 Technical Challenges
- Infrastructure Evolution: Design and implement scalable deployment ecosystems using Terraform and Kubernetes, embedding security and operational best practices from the outset.
- Automation and Reliability: Deliver automation across provisioning, deployment, recovery, and operational workflows, significantly reducing manual effort and operational risk.
- Service Health Optimization: Define and implement meaningful SLOs and KPIs tied to service health and business goals, driving optimizations and cost reduction.
🌱 Learning & Development Opportunities
- Technical Skill Development: Stay up-to-date with emerging technologies, tools, and best practices in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
- Leadership and Mentorship: Develop leadership skills and mentor less experienced engineers, fostering a culture of shared responsibility and engineering excellence.
- Architecture Decision-Making: Gain experience in architecture decision-making, contributing to the design and evolution of Stacklok's infrastructure and platform.
💡 Interview Preparation
💡 Technical Questions
- Kubernetes and Terraform: Demonstrate your ability to design, deploy, and manage applications using Kubernetes and Terraform, with a focus on scalability, reliability, and security.
- Incident Response: Describe your experience with incident response automation and proactively detecting issues using telemetry and observability tools.
- SLOs and KPIs: Explain your approach to defining and using SLOs and KPIs to guide reliability goals and improve service quality.
💡 Company & Culture Questions
- AI-First Company: Demonstrate your understanding of Stacklok's AI-first mission and how your role as an SRE contributes to the company's goals.
- Collaboration and Communication: Showcase your ability to collaborate effectively with cross-functional teams and communicate complex technical concepts clearly and concisely.
💡 Portfolio Presentation Strategy
- Infrastructure Case Studies: Highlight your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
- Code Quality and Documentation: Demonstrate your ability to write clean, well-documented code and maintain high-quality runbooks and system diagrams.
📌 Application Steps
To apply for this Senior Site Reliability Engineer (SRE) position at Stacklok:
- Submit your application through the application link provided.
- Customize your resume to highlight your relevant experience in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
- Prepare a portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
- Research Stacklok's AI-first mission and company culture, and be prepared to discuss how your skills and experience align with the company's goals.
⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
Candidates should have a strong foundation in SRE and experience with programming, particularly in languages like Python or Go. Deep experience with Terraform, Kubernetes, and cloud-native operations is essential, along with a track record of delivering technical solutions that drive measurable business outcomes.