Platform Site Reliability Engineer
📍 Job Overview
- Job Title: Platform Site Reliability Engineer
- Company: Nexthink
- Location: Phoenix, Arizona, United States
- Job Type: Full-time
- Category: DevOps, Site Reliability Engineering
- Date Posted: 2025-06-27
- Experience Level: 5-10 years
- Remote Status: On-site/Hybrid
🚀 Role Summary
- Design, build, and maintain reliable, secure, and scalable infrastructure for a multi-tenant SaaS platform.
- Monitor system health and application performance, and improve incident response practices.
- Collaborate with software engineers to embed reliability and observability into every service.
- Work closely with the team to ensure high availability and performance of the platform.
📝 Enhancement Note: This role requires a strong background in Site Reliability Engineering (SRE) and platform engineering, with a focus on cloud services and Kubernetes. Familiarity with chaos engineering and compliance standards is a plus.
💻 Primary Responsibilities
- Infrastructure Design & Maintenance: Design, build, and maintain the infrastructure powering the multi-tenant SaaS platform, ensuring reliability, security, and scalability.
- Cloud Services Management: Implement and manage cloud-native systems using best-in-class tools and automation on AWS.
- Kubernetes Cluster Management: Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery.
- SLO & SLA Management: Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
- Incident Response & Resolution: Troubleshoot, narrow down, and fix incidents with minimal intervention of other functions. Participate in a shared on-call rotation and drive timely resolution and communication.
- Automation & Tooling: Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning. Experience in programming solutions for platform tools is a plus.
- Security & Compliance: Contribute to security best practices, compliance automation, and cost optimization.
📝 Enhancement Note: This role requires a strong focus on incident prevention and proactive monitoring to ensure high availability and performance of the platform. A deep understanding of Linux systems, networking, and common troubleshooting practices is essential.
🎓 Skills & Qualifications
Education: Bachelor's degree in Computer Science, Engineering, or a related field.
Experience: Minimum of 5 years in an SRE/platform engineering role supporting SaaS platforms.
Required Skills:
- Strong hands-on experience with public cloud services (AWS, GCP, Azure).
- Proficiency with Kubernetes, container-based deployment, and related ecosystems (Helm, etc.).
- Strong programming or scripting skills (Python, Go, Bash, etc.).
- Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD).
- Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.).
- Comfort with being part of a rotating on-call schedule, including handling critical incidents and conducting post-incident reviews.
- Strong system-level troubleshooting skills and a proactive mindset toward incident prevention.
- Deep understanding of Linux systems, networking, and common troubleshooting practices.
- Experience supporting multi-tenant microservices architectures.
Preferred Skills:
- Familiarity with service mesh (e.g., Istio).
- Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
- Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus.
- Experience with chaos engineering or resilience testing practices.
📝 Enhancement Note: This role requires a strong background in SRE and platform engineering, with a focus on cloud services and Kubernetes. Familiarity with chaos engineering and compliance standards is a plus.
📊 Web Portfolio & Project Requirements
Portfolio Essentials:
- Demonstrate your experience with cloud services (AWS, GCP, Azure) and Kubernetes.
- Showcase your programming or scripting skills with examples of infrastructure as code (Terraform, etc.).
- Highlight your experience with CI/CD pipelines and observability stacks.
- Include examples of your system-level troubleshooting skills and incident response processes.
Technical Documentation:
- Provide documentation for your infrastructure as code (Terraform, etc.) projects, explaining the architecture, deployment processes, and server configuration.
- Include testing methodologies, performance metrics, and optimization techniques for your projects.
📝 Enhancement Note: This role requires a strong focus on infrastructure as code, automation, and documentation. Be prepared to discuss your approach to technical debt and refactoring.
💵 Compensation & Benefits
Salary Range: $120,000 - $180,000 per year (based on 5-10 years of experience in the Phoenix, AZ area)
Benefits:
- Flexible hours and unlimited vacation (employees have unlimited paid time off on top of the 15 days of holidays we offer), 11 company-paid holidays, and 3 extra days for volunteering.
- Hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
- Free access to professional training platforms to explore your interests and enhance your skills.
- Up to 16 weeks of paid leave for birthing parents/primary caregivers, 6 weeks for secondary caregivers.
- Plan for the future with a 401(k) plan featuring up to 4% company matching contributions, vesting immediately, to grow your retirement savings.
- Bonuses for referring successful hires after three months of continuous employment.
Working Hours: 40 hours per week, with flexible hours and a hybrid work model.
📝 Enhancement Note: The salary range provided is based on market research for the Phoenix, AZ area and the experience level required for this role. Benefits are comprehensive and include health, dental, vision, and retirement plans.
🎯 Team & Company Context
🏢 Company Culture
Industry: Nexthink is the leader in digital employee experience management software, providing IT leaders with unprecedented insight into employee issues, allowing them to see, diagnose, and fix problems at scale before employees notice the issue.
Company Size: Nexthink has over 1,200 customers and 1,000 employees across 5 continents, operating as One Team and connecting, collaborating, and innovating to continuously grow. With over 75 nationalities working with them, Nexthink is committed to diversity, inclusion, and equity.
Founded: Nexthink was founded in 2007 and is dual-headquartered in Lausanne, Switzerland, and Boston, Massachusetts.
Team Structure:
- Nexthink operates as One Team, connecting, collaborating, and innovating to continuously grow.
- The Platform Engineering/SRE organization is responsible for designing, building, and maintaining the infrastructure powering the multi-tenant SaaS platform.
- The team works closely with software engineers to embed reliability and observability into every service.
Development Methodology:
- Nexthink uses Agile methodologies, with structured onboarding to foster connections and team integration.
- The company emphasizes collaboration, knowledge sharing, and continuous learning.
- Nexthink provides free access to professional training platforms to explore interests and enhance skills.
Company Website: Nexthink
📝 Enhancement Note: Nexthink's culture is centered around collaboration, innovation, and continuous learning. The company values diversity, inclusion, and equity, with over 75 nationalities working together across 5 continents.
📈 Career & Growth Analysis
Web Technology Career Level: This role is at the senior level, requiring a minimum of 5 years of experience in SRE/platform engineering. The role offers significant opportunities for growth and leadership within the Platform Engineering/SRE organization.
Reporting Structure: The Platform Site Reliability Engineer reports directly to the Head of Platform Engineering/SRE.
Technical Impact: This role has a significant impact on the reliability, security, and scalability of the multi-tenant SaaS platform. The Platform Site Reliability Engineer works closely with software engineers to embed reliability and observability into every service, ensuring high availability and performance of the platform.
Growth Opportunities:
- Technical Leadership: This role offers opportunities for technical leadership within the Platform Engineering/SRE organization, with the potential to manage teams and drive architecture decisions.
- Emerging Technology Adoption: Nexthink is at the forefront of digital employee experience management software, offering opportunities to work with emerging technologies and drive innovation.
- Career Progression: With over 1,000 employees across 5 continents, Nexthink offers significant opportunities for career progression within the company.
📝 Enhancement Note: This role offers significant opportunities for growth and leadership within the Platform Engineering/SRE organization. With Nexthink's commitment to innovation and emerging technology adoption, this role is ideal for candidates looking to advance their careers in SRE and platform engineering.
🌐 Work Environment
Office Type: Nexthink's Phoenix office is a hybrid work environment, balancing office and remote work with structured onboarding to foster connections and team integration.
Office Location(s): Phoenix, Arizona, United States
Workspace Context:
- Nexthink provides free access to professional training platforms to explore interests and enhance skills.
- The company offers flexible hours and unlimited vacation, allowing employees to balance work and personal responsibilities.
- Nexthink's culture emphasizes collaboration, knowledge sharing, and continuous learning, with regular team-building activities and social events.
Work Schedule: 40 hours per week, with flexible hours and a hybrid work model.
📝 Enhancement Note: Nexthink's hybrid work environment offers the best of both worlds, allowing employees to balance office and remote work while fostering connections and team integration. The company's commitment to collaboration, knowledge sharing, and continuous learning creates a dynamic and engaging work environment.
📄 Application & Technical Interview Process
Interview Process:
- Technical Phone Screen: A 30-minute phone screen to assess your technical skills and understanding of SRE and platform engineering concepts.
- Technical Deep Dive: A 2-hour deep dive into your technical skills, focusing on your experience with cloud services, Kubernetes, and infrastructure as code. You will be asked to discuss your approach to incident response, automation, and documentation.
- Behavioral Interview: A 1-hour behavioral interview to assess your cultural fit, problem-solving skills, and ability to work collaboratively in a team environment.
- Final Interview: A 1-hour final interview with the hiring manager to discuss your career goals, expectations, and any remaining questions you may have.
Portfolio Review Tips:
- Highlight your experience with cloud services (AWS, GCP, Azure) and Kubernetes.
- Include examples of your programming or scripting skills with infrastructure as code (Terraform, etc.) projects.
- Showcase your experience with CI/CD pipelines and observability stacks.
- Include examples of your system-level troubleshooting skills and incident response processes.
Technical Challenge Preparation:
- Brush up on your knowledge of cloud services (AWS, GCP, Azure), Kubernetes, and infrastructure as code (Terraform, etc.).
- Practice your problem-solving skills and be prepared to discuss your approach to incident response, automation, and documentation.
- Familiarize yourself with Nexthink's products and services, and be prepared to discuss how your technical skills align with the company's needs.
📝 Enhancement Note: Nexthink's interview process is designed to assess your technical skills, cultural fit, and problem-solving abilities. Be prepared to discuss your approach to incident response, automation, and documentation, and highlight your experience with cloud services, Kubernetes, and infrastructure as code.
🛠 Technology Stack & Web Infrastructure
Frontend Technologies: Not applicable for this role.
Backend & Server Technologies:
- Cloud Services: AWS (Amazon Web Services) – Nexthink's primary cloud provider, with experience in GCP (Google Cloud Platform) and Azure a plus.
- Kubernetes: Kubernetes is the container orchestration platform used by Nexthink to manage and deploy applications at scale.
- Infrastructure as Code: Terraform – Nexthink uses Terraform to provision and manage infrastructure in a repeatable and automated way.
- Observability Stacks: Prometheus, Grafana, Datadog – Nexthink uses these tools to monitor system health, application performance, and user-facing SLAs.
Development & DevOps Tools:
- CI/CD Pipelines: GitHub Actions, ArgoCD – Nexthink uses these tools to automate the deployment and testing of applications.
- Monitoring Tools: Datadog, Prometheus, Grafana – Nextrack uses these tools to monitor system health, application performance, and user-facing SLAs.
- Version Control: Git – Nexthink uses Git for version control and collaboration on code and infrastructure projects.
📝 Enhancement Note: Nexthink's technology stack is centered around cloud services, Kubernetes, and infrastructure as code. The company uses a combination of open-source and commercial tools to automate deployment, testing, and monitoring.
👥 Team Culture & Values
Web Development Values:
- Reliability: Nexthink values reliability above all else, ensuring high availability and performance of the multi-tenant SaaS platform.
- Security: Nexthink is committed to security best practices, compliance automation, and cost optimization.
- Innovation: Nexthink is at the forefront of digital employee experience management software, offering opportunities to work with emerging technologies and drive innovation.
- Collaboration: Nexthink operates as One Team, connecting, collaborating, and innovating to continuously grow. The company values collaboration, knowledge sharing, and continuous learning.
Collaboration Style:
- Cross-Functional Integration: Nexthink's Platform Engineering/SRE organization works closely with software engineers, designers, and stakeholders to embed reliability and observability into every service.
- Code Review Culture: Nexthink emphasizes code review culture and peer programming practices to ensure high-quality code and knowledge sharing.
- Knowledge Sharing: Nextrack provides free access to professional training platforms to explore interests and enhance skills, fostering a culture of continuous learning and growth.
📝 Enhancement Note: Nexthink's culture is centered around collaboration, innovation, and continuous learning. The company values reliability, security, and collaboration, with a strong focus on embedding reliability and observability into every service.
⚡ Challenges & Growth Opportunities
Technical Challenges:
- Cloud Services: Nexthink's primary cloud provider is AWS, with experience in GCP and Azure a plus. Candidates should be comfortable working with cloud services and managing multi-tenant SaaS platforms.
- Kubernetes: Nexthink uses Kubernetes to manage and deploy applications at scale. Candidates should have strong experience with Kubernetes and container-based deployment.
- Incident Response: Nexthink values proactive incident prevention and response. Candidates should be comfortable working in an on-call rotation and driving timely resolution and communication.
Learning & Development Opportunities:
- Technical Leadership: Nexthink offers opportunities for technical leadership within the Platform Engineering/SRE organization, with the potential to manage teams and drive architecture decisions.
- Emerging Technology Adoption: Nextrack is at the forefront of digital employee experience management software, offering opportunities to work with emerging technologies and drive innovation.
- Career Progression: With over 1,000 employees across 5 continents, Nexthink offers significant opportunities for career progression within the company.
📝 Enhancement Note: Nexthink's technical challenges and growth opportunities are centered around cloud services, Kubernetes, and incident response. The company offers significant opportunities for technical leadership, emerging technology adoption, and career progression.
💡 Interview Preparation
Technical Questions:
- Cloud Services: Questions related to AWS, GCP, or Azure, focusing on your experience with multi-tenant SaaS platforms and cloud services management.
- Kubernetes: Questions related to Kubernetes, container-based deployment, and related ecosystems (Helm, etc.).
- Incident Response: Questions related to incident response, automation, and documentation, focusing on your approach to proactive incident prevention and response.
Company & Culture Questions:
- Company Culture: Questions related to Nexthink's culture, values, and work environment, focusing on your understanding of the company's commitment to collaboration, innovation, and continuous learning.
- Team Dynamics: Questions related to team dynamics, cross-functional collaboration, and knowledge sharing, focusing on your ability to work collaboratively in a team environment.
- Career Goals: Questions related to your career goals, expectations, and long-term plans, focusing on your alignment with Nexthink's opportunities for growth and leadership.
Portfolio Presentation Strategy:
- Cloud Services: Highlight your experience with cloud services (AWS, GCP, Azure) and multi-tenant SaaS platforms.
- Kubernetes: Showcase your experience with Kubernetes, container-based deployment, and related ecosystems (Helm, etc.).
- Incident Response: Include examples of your system-level troubleshooting skills and incident response processes.
- Automation & Documentation: Highlight your experience with infrastructure as code (Terraform, etc.) and automation tools, focusing on your approach to incident prevention and response.
📝 Enhancement Note: Nexthink's interview process is designed to assess your technical skills, cultural fit, and problem-solving abilities. Be prepared to discuss your approach to incident response, automation, and documentation, and highlight your experience with cloud services, Kubernetes, and infrastructure as code.
📌 Application Steps
To apply for this Platform Site Reliability Engineer position:
- Submit your application through the application link provided on the job posting.
- Prepare your portfolio by highlighting your experience with cloud services (AWS, GCP, Azure), Kubernetes, and infrastructure as code (Terraform, etc.). Include examples of your system-level troubleshooting skills and incident response processes.
- Optimize your resume for web technology roles, emphasizing your project highlights and technical skills relevant to this position.
- Prepare for the technical interview by brushing up on your knowledge of cloud services, Kubernetes, and infrastructure as code (Terraform, etc.). Practice your problem-solving skills and be prepared to discuss your approach to incident response, automation, and documentation.
- Research Nexthink and its products and services, focusing on the company's commitment to collaboration, innovation, and continuous learning. Be prepared to discuss how your technical skills align with the company's needs.
⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
Candidates should have a minimum of 5 years in an SRE/platform engineering role supporting SaaS platforms and strong hands-on experience with public cloud services. Proficiency in Kubernetes and programming or scripting skills is also required.