📍 Job Overview

Job Title: Senior Site Reliability Engineer (SRE)
Company: Stacklok
Location: Bellevue, Washington
Job Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
Category: DevOps, Infrastructure
Date Posted: June 25, 2025

🚀 Role Summary

Design, build, and operate the infrastructure that powers Stacklok's products and services, ensuring reliability, scalability, and security.
Collaborate with cross-functional teams to deliver secure, scalable infrastructure for real-world AI use cases.
Evolve cloud-native systems using Kubernetes, Terraform, and ArgoCD, automating deployments and incident response.
Define and maintain key metrics, using telemetry and observability tools to proactively detect issues and drive systemic improvements.
Champion operational excellence by establishing and iterating on SLOs, incident response, and on-call practices.

💻 Primary Responsibilities

🔧 Design and Operate Reliable Infrastructure

Contribute to the evolution of Stacklok's infrastructure by designing and managing production systems that support multiple engineering teams.
Continuously improve platform performance, availability, and operational robustness through well-engineered solutions.

🤖 Automate Operational Workflows

Apply an automation-first mindset to reduce manual processes in provisioning, deployment, and incident response.
Deliver resilient tooling and workflows that enable faster delivery and improve reliability.

📈 Monitor and Improve Service Health

Define and maintain key metrics that reflect system performance and reliability.
Use telemetry and observability tooling to proactively detect issues and drive systemic improvements.

🏆 Champion Operational Excellence

Establish and iterate on SLOs, incident response, and on-call practices that ensure reliable service delivery.
Promote a culture of accountability, preparedness, and continuous improvement.

👩‍🏫 Mentor and Enable Engineering Teams

Share production knowledge, write and maintain high-quality runbooks and system documentation, and support engineers in adopting sound operational practices.
Contribute to a healthy, inclusive engineering culture through mentorship and collaboration.

🎓 Skills & Qualifications

🎓 Education

Bachelor's degree in Computer Science, a related field, or equivalent experience.

🕒 Experience

Proven experience (5+ years) in Site Reliability Engineering or a similar role.
Experience with programming languages such as Python, Go, Bash, or similar, with an emphasis on clear structure, testing, and operational reliability.

🛠 Required Skills

Strong foundation in Site Reliability Engineering (SRE).
Proficiency in Infrastructure as Code (IaC) tools, such as Terraform.
Hands-on experience with Kubernetes and Docker in production environments.
Familiarity with cloud-native architecture patterns and cloud provider experience (AWS preferred).
Experience with GitOps practices and deployment tooling, such as ArgoCD.
Proficient with log aggregation and telemetry tools, such as AWS CloudWatch, Prometheus, Grafana, or similar.
Experience defining and using SLOs and KPIs to guide reliability goals and improve service quality.
Strong written and verbal communication skills, with the ability to collaborate across technical and non-technical audiences.

🌟 Preferred Skills

Experience with incident response automation tools, such as PagerDuty.
Familiarity with operational and infrastructure security best practices.
Track record of delivering technical solutions that drive measurable business outcomes.
Experience working in a fast-paced, high-growth startup environment.

📊 Web Portfolio & Project Requirements

📂 Portfolio Essentials

A portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure.
Examples of infrastructure as code (IaC) using Terraform or similar tools.
Demonstrations of cloud-native deployments and incident response automation.
Case studies highlighting your ability to improve service health, reduce toil, and drive systemic improvements.

📄 Technical Documentation

Well-documented runbooks, system diagrams, and process flows that demonstrate your ability to communicate complex technical concepts clearly and concisely.
Examples of code quality, testing, and operational best practices in your portfolio projects.

💵 Compensation & Benefits

💰 Salary Range

$156,000 - $198,000 per year

🎁 Benefits

Equity
Comprehensive healthcare
Flexible work environment
Flexible PTO

🎯 Team & Company Context

🏢 Company Culture

Industry: AI-first company focused on enterprise developer tools and agentic AI systems.
Company Size: Small to medium-sized, fast-growing startup.
Founded: 2021
Team Structure: Cross-functional teams with deep experience in open source, cloud-native technologies, security, and developer tools.
Development Methodology: Open, collaborative, and community-driven, with strong roots in open source and cloud-native technologies.

📈 Career & Growth Analysis

Web Technology Career Level: Senior Site Reliability Engineer (SRE)
Reporting Structure: Reports directly to the Director of Engineering or a similar role, collaborating with multiple engineering teams.
Technical Impact: Own key production systems, lead reliability-focused engineering efforts, and drive systemic improvements in service health and reliability.

🌐 Work Environment

Office Type: Hybrid (3 days in office: Tuesday, Wednesday, Thursday)
Office Location(s): Bellevue, Washington (planning to relocate to a more central location)
Workspace Context: Collaborative workspace with a focus on in-person interaction for three days a week, balancing flexibility with the value of in-person collaboration and community.

🛠 Technology Stack & Web Infrastructure

🛠️‍💻 Frontend Technologies

Not applicable for this role

💻 Backend & Server Technologies

Kubernetes: Proficiency in designing, deploying, and managing applications using Kubernetes.
Terraform: Experience with Infrastructure as Code (IaC) using Terraform or similar tools.
ArgoCD: Familiarity with GitOps practices and deployment tooling using ArgoCD or similar tools.
AWS: Proficiency with at least one major cloud provider, with AWS experience preferred.

🛠️ DevOps Tools

Git: Proficiency in using Git for version control and collaborative development.
PagerDuty: Experience with incident response automation tools like PagerDuty.
Prometheus & Grafana: Familiarity with log aggregation, telemetry, and monitoring tools such as Prometheus and Grafana.

👥 Team Culture & Values

🌟 Web Development Values

Reliability: Prioritize reliability and availability in all aspects of infrastructure design and operation.
Scalability: Build infrastructure that can scale with product adoption and user demand.
Security: Implement and maintain strong operational security practices, including secure software supply chain considerations.
Collaboration: Foster a culture of collaboration, mentorship, and knowledge sharing across engineering teams.
Continuous Improvement: Embrace a mindset of continuous improvement, driving systemic changes based on data-driven insights and user feedback.

🤝 Collaboration Style

Cross-functional Integration: Work closely with product, design, and other teams to deliver reliable, scalable infrastructure that meets business needs.
Code Review Culture: Participate in code reviews and contribute to a culture of shared responsibility and engineering excellence.
Mentorship and Knowledge Sharing: Share production knowledge, write and maintain high-quality runbooks, and support engineers in adopting sound operational practices.

🌱 Challenges & Growth Opportunities

🌱 Technical Challenges

Infrastructure Evolution: Design and implement scalable deployment ecosystems using Terraform and Kubernetes, embedding security and operational best practices from the outset.
Automation and Reliability: Deliver automation across provisioning, deployment, recovery, and operational workflows, significantly reducing manual effort and operational risk.
Service Health Optimization: Define and implement meaningful SLOs and KPIs tied to service health and business goals, driving optimizations and cost reduction.

🌱 Learning & Development Opportunities

Technical Skill Development: Stay up-to-date with emerging technologies, tools, and best practices in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
Leadership and Mentorship: Develop leadership skills and mentor less experienced engineers, fostering a culture of shared responsibility and engineering excellence.
Architecture Decision-Making: Gain experience in architecture decision-making, contributing to the design and evolution of Stacklok's infrastructure and platform.

💡 Interview Preparation

💡 Technical Questions

Kubernetes and Terraform: Demonstrate your ability to design, deploy, and manage applications using Kubernetes and Terraform, with a focus on scalability, reliability, and security.
Incident Response: Describe your experience with incident response automation and proactively detecting issues using telemetry and observability tools.
SLOs and KPIs: Explain your approach to defining and using SLOs and KPIs to guide reliability goals and improve service quality.

💡 Company & Culture Questions

AI-First Company: Demonstrate your understanding of Stacklok's AI-first mission and how your role as an SRE contributes to the company's goals.
Collaboration and Communication: Showcase your ability to collaborate effectively with cross-functional teams and communicate complex technical concepts clearly and concisely.

💡 Portfolio Presentation Strategy

Infrastructure Case Studies: Highlight your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
Code Quality and Documentation: Demonstrate your ability to write clean, well-documented code and maintain high-quality runbooks and system diagrams.

📌 Application Steps

To apply for this Senior Site Reliability Engineer (SRE) position at Stacklok:

Submit your application through the application link provided.
Customize your resume to highlight your relevant experience in Site Reliability Engineering, cloud-native operations, and infrastructure as code.
Prepare a portfolio showcasing your experience in designing, implementing, and managing reliable infrastructure, with a focus on automation, scalability, and security.
Research Stacklok's AI-first mission and company culture, and be prepared to discuss how your skills and experience align with the company's goals.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Senior Site Reliability Engineer (SRE)

📍 Job Overview

🚀 Role Summary

💻 Primary Responsibilities

🔧 Design and Operate Reliable Infrastructure

🤖 Automate Operational Workflows

📈 Monitor and Improve Service Health

🏆 Champion Operational Excellence

👩‍🏫 Mentor and Enable Engineering Teams

🎓 Skills & Qualifications

🎓 Education

🕒 Experience

🛠 Required Skills

🌟 Preferred Skills

📊 Web Portfolio & Project Requirements

📂 Portfolio Essentials

📄 Technical Documentation

💵 Compensation & Benefits

💰 Salary Range

🎁 Benefits

🎯 Team & Company Context

🏢 Company Culture

📈 Career & Growth Analysis

🌐 Work Environment

🛠 Technology Stack & Web Infrastructure

🛠️‍💻 Frontend Technologies

💻 Backend & Server Technologies

🛠️ DevOps Tools

👥 Team Culture & Values

🌟 Web Development Values

🤝 Collaboration Style

🌱 Challenges & Growth Opportunities

🌱 Technical Challenges

🌱 Learning & Development Opportunities

💡 Interview Preparation

💡 Technical Questions

💡 Company & Culture Questions

💡 Portfolio Presentation Strategy

📌 Application Steps

Application Requirements

Company

Jobs

Job Feeds

Legal