Senior Service Reliability Engineer (SRE)

Referrals Only
Full_timeβ€’Santiago, Chile

πŸ“ Job Overview

  • Job Title: Senior Service Reliability Engineer (SRE)
  • Company: Thoughtworks (Referrals Only)
  • Location: Santiago, RegiΓ³n Metropolitana, Chile
  • Job Type: Full-time
  • Category: DevOps, Infrastructure
  • Date Posted: June 11, 2025
  • Experience Level: 5-10 years
  • Remote Status: Remote OK

πŸš€ Role Summary

  • Drive site reliability and resilience through strategic automation, monitoring, and incident response.
  • Collaborate cross-functionally with application development teams to improve system reliability.
  • Champion Site Reliability Engineering (SRE) principles to evolve traditional operations into a more agile and customer-focused approach.
  • Ensure technical excellence and operational efficiency within the infrastructure domain.

πŸ“ Enhancement Note: This role requires a strong background in Site Reliability Engineering, with a focus on improving reliability, resilience, and system performance. The ideal candidate will have experience working with various cloud platforms and a deep understanding of modern design patterns.

πŸ’» Primary Responsibilities

  • Reliability Engineering: Improve site reliability by building fault-tolerant mechanisms and reducing response times.
  • Observability Integration: Drive the integration of observability automation into the CI/CD pipeline.
  • Incident Management: Handle production incidents, manage incident communication, and draft root cause analysis documents.
  • Performance Scaling: Monitor and improve the performance of production systems to meet business goals within expected SLA and SLO metrics.
  • Reliability Consultation: Work closely with application development teams to advise on improving system reliability and assist in implementing reliability improvements.
  • Observability Improvement: Enhance system observability by improving logging, metrics, and reducing false alarms to eliminate unnecessary toil and improve process efficiency.
  • Chaos Engineering: Implement chaos engineering practices to test system reliability and set up processes for regular testing.
  • Reliability Direction: Understand client goals and business needs to set direction for site reliability, such as achieving application availability with minimum/no disruption (99.999%).

πŸ“ Enhancement Note: This role requires a strong focus on problem-solving, incident management, and driving reliability improvements. The ideal candidate will have experience working in high-availability environments and a proven track record of improving system reliability.

πŸŽ“ Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may substitute for formal education.

Experience: 5-10 years of experience in Site Reliability Engineering, DevOps, or a related role.

Required Skills:

  • Hands-on experience in programming and scripting languages such as Python, Go, or Bash.
  • Strong understanding of Cloud GCP.
  • Proficiency in using observability tools (e.g., Grafana, Datadog, New Relic, ELK Stack, Dynatrace) to dissect and identify root causes of system and infrastructure issues.
  • Familiarity with DevOps and GitOps practices.
  • Strong knowledge of container-based architecture and orchestration tools (e.g., Kubernetes, AWS EKS, Docker Swarm, Nomad).
  • Understanding of technical architecture and modern design patterns, including microservices, serverless functions, NoSQL, and RESTful APIs.
  • Experience in creating infrastructure resources that follow Cloud's Well-Architected Framework principles.

Preferred Skills:

  • Experience working with multiple cloud platforms (e.g., AWS, Azure, GCP).
  • Familiarity with chaos engineering practices and tools.
  • Knowledge of infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation, Pulumi).

πŸ“ Enhancement Note: This role requires a strong technical background in Site Reliability Engineering, with a focus on improving reliability, resilience, and system performance. The ideal candidate will have experience working with various cloud platforms and a deep understanding of modern design patterns.

πŸ“Š Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate your ability to improve site reliability through case studies or projects showcasing fault-tolerant mechanisms, reduced response times, and improved system performance.
  • Showcase your experience with observability tools by presenting metrics, logs, and dashboards that helped identify and resolve system issues.
  • Highlight your incident management skills through examples of successfully handled production incidents, including communication with clients and root cause analysis documents.

Technical Documentation:

  • Provide code samples or snippets demonstrating your proficiency in programming and scripting languages (e.g., Python, Go, Bash).
  • Showcase your understanding of container-based architecture and orchestration tools (e.g., Kubernetes, Docker) through project documentation or deployment scripts.
  • Include any relevant certifications or training in Site Reliability Engineering, DevOps, or related fields.

πŸ“ Enhancement Note: This role requires a strong focus on portfolio demonstration, with a particular emphasis on site reliability improvements, observability tools, and incident management. The ideal candidate will have a well-structured portfolio showcasing their technical skills and achievements in these areas.

πŸ’΅ Compensation & Benefits

Salary Range: The estimated salary range for a Senior Service Reliability Engineer in Santiago, Chile, is between CLP 7,000,000 and CLP 9,000,000 per year (USD 9,500 - USD 12,200). This estimate is based on regional market data and industry standards for similar roles.

Benefits:

  • Competitive benefits package, including health insurance and retirement plans.
  • Learning and development opportunities, such as interactive tools, development programs, and cultivation culture support.
  • A dynamic work environment that values diversity, inclusion, and collaboration.

Working Hours: Full-time position with a standard workweek of 40 hours. This role may require participation in a 24/7 on-call rotation and availability for maintenance windows and project deadlines.

πŸ“ Enhancement Note: The estimated salary range provided is based on regional market data and industry standards for similar roles. The actual salary may vary depending on factors such as experience, skills, and company-specific compensation structures.

🎯 Team & Company Context

🏒 Company Culture

Industry: Thoughtworks is a global technology consultancy that integrates strategy, design, and engineering to drive digital innovation. They specialize in working with clients to build solutions that look past the obvious.

Company Size: Thoughtworks is a mid-sized company with a strong cultivation culture that supports employee growth and development. This size allows for a balance between autonomy and support, enabling employees to take ownership of their careers while benefiting from the strength of the company's culture.

Founded: Thoughtworks was founded in 1990 and has since grown into a global organization with offices in over 40 countries.

Team Structure:

  • The SRE team works closely with application development teams to improve system reliability and drive operational excellence.
  • The team is responsible for ensuring technical excellence and operational efficiency within the infrastructure domain.
  • The SRE role requires strong collaboration and communication skills, as well as the ability to work effectively with cross-functional teams.

Development Methodology:

  • Thoughtworks follows Agile development methodologies, with a focus on iterative development, continuous improvement, and customer collaboration.
  • The SRE team works closely with application development teams to integrate observability automation into the CI/CD pipeline and ensure reliable deployments.
  • The team uses tools such as Jira, Confluence, and Git to manage projects, track progress, and collaborate on code.

Company Website: Thoughtworks

πŸ“ Enhancement Note: Thoughtworks is known for its strong cultivation culture, which focuses on empowering employees to take ownership of their careers and supporting their growth and development. This culture, combined with the company's focus on digital innovation and collaboration, makes Thoughtworks an attractive option for professionals looking to advance their careers in the tech industry.

πŸ“ˆ Career & Growth Analysis

Web Technology Career Level: Senior Service Reliability Engineer (SRE) roles require a deep understanding of Site Reliability Engineering principles, with a focus on improving reliability, resilience, and system performance. This role involves driving reliability improvements, collaborating with cross-functional teams, and championing SRE practices within the organization.

Reporting Structure: The Senior Service Reliability Engineer reports directly to the SRE team lead or manager and works closely with application development teams to improve system reliability.

Technical Impact: In this role, you will have a significant impact on the reliability and performance of the company's infrastructure. Your work will directly contribute to the achievement of business goals and the delivery of high-quality products and services to clients.

Growth Opportunities:

  • Technical Growth: Deepen your expertise in Site Reliability Engineering, cloud platforms, and modern design patterns. Explore opportunities to specialize in specific areas, such as chaos engineering, observability, or incident management.
  • Leadership Development: Develop your leadership skills by mentoring junior team members, driving process improvements, and contributing to the growth of the SRE team.
  • Architecture Decisions: As you gain experience and expertise, you may have the opportunity to make strategic architecture decisions that shape the direction of the company's infrastructure.

πŸ“ Enhancement Note: This role offers significant opportunities for career growth and development, with a focus on technical expertise, leadership, and architecture decision-making. The ideal candidate will be proactive in seeking out new challenges and taking ownership of their career progression.

🌐 Work Environment

Office Type: Thoughtworks has a collaborative work environment that fosters cross-functional team interaction and knowledge sharing. The company's offices are designed to be comfortable, modern, and well-equipped with the tools and resources needed to support productivity and innovation.

Office Location(s): Santiago, Chile

Workspace Context:

  • Collaborative Environment: Thoughtworks' offices are designed to encourage collaboration and teamwork, with open-plan workspaces, meeting rooms, and breakout areas.
  • Development Tools: The company provides access to the latest development tools, multiple monitors, and testing devices to support productivity and efficiency.
  • Cross-Functional Interaction: The SRE team works closely with application development teams, designers, and other stakeholders to ensure reliable and high-quality products and services.

Work Schedule: Full-time position with a standard workweek of 40 hours. This role may require participation in a 24/7 on-call rotation and availability for maintenance windows and project deadlines.

πŸ“ Enhancement Note: Thoughtworks' collaborative work environment, combined with its focus on cross-functional team interaction and knowledge sharing, makes it an attractive option for professionals looking to advance their careers in the tech industry.

πŸ“„ Application & Technical Interview Process

Interview Process:

  1. Technical Assessment: A hands-on technical assessment focused on Site Reliability Engineering, cloud platforms, and modern design patterns. This assessment may include coding challenges, system design exercises, and incident management scenarios.
  2. Behavioral Interview: A behavioral interview focused on problem-solving, communication, and collaboration skills. This interview may include questions about your experience working with cross-functional teams, incident management, and driving reliability improvements.
  3. Final Evaluation: A final evaluation based on your technical assessment, behavioral interview, and cultural fit. This evaluation may include a discussion of your career goals, growth opportunities, and alignment with the company's values and culture.

Portfolio Review Tips:

  • Highlight your experience with Site Reliability Engineering, cloud platforms, and modern design patterns through case studies, projects, and technical documentation.
  • Showcase your incident management skills through examples of successfully handled production incidents, including communication with clients and root cause analysis documents.
  • Include any relevant certifications or training in Site Reliability Engineering, DevOps, or related fields.

Technical Challenge Preparation:

  • Brush up on your knowledge of Site Reliability Engineering principles, cloud platforms, and modern design patterns.
  • Practice incident management scenarios and develop your problem-solving skills through hands-on exercises and case studies.
  • Prepare for behavioral interview questions by reflecting on your experience working with cross-functional teams, incident management, and driving reliability improvements.

ATS Keywords: Site Reliability Engineering, SRE, Cloud GCP, Observability, Incident Management, Reliability, Resilience, Infrastructure, DevOps, GitOps, Kubernetes, Microservices, Serverless Functions, NoSQL, RESTful APIs, Chaos Engineering, Technical Leadership, Architecture Decisions, Agile, Collaboration, Cross-Functional Teams, Problem-Solving, Incident Management, Reliability Improvements.

πŸ“ Enhancement Note: The interview process for this role is designed to assess the candidate's technical expertise in Site Reliability Engineering, as well as their problem-solving, communication, and collaboration skills. The ideal candidate will have a strong portfolio demonstrating their experience with cloud platforms, modern design patterns, and incident management.

πŸ›  Technology Stack & Web Infrastructure

Cloud Platform: GCP (Google Cloud Platform)

Observability Tools:

  • Grafana
  • Datadog
  • New Relic
  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Dynatrace

Containerization & Orchestration:

  • Kubernetes
  • AWS EKS (Elastic Kubernetes Service)
  • Docker Swarm
  • Nomad

Infrastructure as Code (IaC) Tools:

  • Terraform
  • CloudFormation
  • Pulumi

Monitoring & Alerting:

  • Prometheus
  • Grafana
  • AlertManager

CI/CD Pipeline:

  • Jenkins
  • GitLab CI/CD
  • CircleCI

Version Control:

  • Git
  • GitHub
  • GitLab

Collaboration & Communication:

  • Jira
  • Confluence
  • Slack
  • Microsoft Teams

πŸ“ Enhancement Note: This role requires a strong understanding of the Thoughtworks technology stack, including cloud platforms, observability tools, containerization, and infrastructure as code. The ideal candidate will have experience working with these tools and a deep understanding of modern design patterns and Site Reliability Engineering principles.

πŸ‘₯ Team Culture & Values

Web Development Values:

  • Reliability: Thoughtworks values reliability as a critical component of delivering high-quality products and services to clients. The company is committed to driving reliability improvements and ensuring the availability and performance of its infrastructure.
  • Collaboration: Thoughtworks fosters a collaborative work environment that encourages cross-functional team interaction and knowledge sharing. The company values open communication, active listening, and a culture of mutual respect.
  • Continuous Improvement: Thoughtworks is committed to continuous improvement, with a focus on iterative development, customer collaboration, and learning from failure. The company encourages experimentation, innovation, and a growth mindset.
  • Customer Focus: Thoughtworks places a strong emphasis on understanding and meeting the needs of its clients. The company works closely with clients to build solutions that look past the obvious and drive digital innovation.

Collaboration Style:

  • Cross-Functional Integration: Thoughtworks encourages collaboration between developers, designers, and other stakeholders to ensure reliable and high-quality products and services.
  • Code Review Culture: The company fosters a code review culture that emphasizes knowledge sharing, peer programming, and continuous learning.
  • Knowledge Sharing: Thoughtworks encourages employees to share their knowledge and expertise with their colleagues, contributing to the growth and development of the team as a whole.

πŸ“ Enhancement Note: Thoughtworks' culture is built on a foundation of collaboration, continuous improvement, and customer focus. The company values open communication, active listening, and a culture of mutual respect, making it an attractive option for professionals looking to advance their careers in the tech industry.

⚑ Challenges & Growth Opportunies

Technical Challenges:

  • Reliability Improvement: Develop and implement fault-tolerant mechanisms and reduce response times to improve site reliability and resilience.
  • Observability Integration: Drive the integration of observability automation into the CI/CD pipeline, reducing false alarms and improving process efficiency.
  • Incident Management: Handle production incidents, manage incident communication with clients, and draft root cause analysis documents to minimize downtime and ensure business continuity.
  • Performance Scaling: Monitor and improve the performance of production systems to meet business goals within expected SLA and SLO metrics.
  • Chaos Engineering: Implement chaos engineering practices to test system reliability and set up processes for regular testing.

Learning & Development Opportunities:

  • Technical Skill Development: Deepen your expertise in Site Reliability Engineering, cloud platforms, and modern design patterns through online courses, workshops, and mentorship opportunities.
  • Conference Attendance: Attend industry conferences, webinars, and meetups to stay up-to-date with the latest trends and best practices in Site Reliability Engineering and DevOps.
  • Certification & Community Involvement: Pursue relevant certifications and engage with online communities to connect with other professionals, share knowledge, and gain insights into emerging technologies and best practices.

πŸ“ Enhancement Note: This role offers significant opportunities for technical growth and development, with a focus on improving reliability, resilience, and system performance. The ideal candidate will be proactive in seeking out new challenges and taking ownership of their career progression.

πŸ’‘ Interview Preparation

Technical Questions:

  • Site Reliability Engineering: Describe your experience with Site Reliability Engineering principles, cloud platforms, and modern design patterns. Provide examples of reliability improvements you've implemented and the results you've achieved.
  • Incident Management: Walk through a scenario where you had to handle a production incident, including your approach to incident communication, root cause analysis, and resolution. Discuss any lessons learned and how you applied them to future incidents.
  • Observability Tools: Explain your experience with observability tools, such as Grafana, Datadog, or New Relic. Describe how you've used these tools to identify and resolve system issues, and how you've integrated them into the CI/CD pipeline.

Company & Culture Questions:

  • Thoughtworks Culture: Describe what you understand about Thoughtworks' culture and how you think you would fit in with the company's values and work environment.
  • Collaboration & Communication: Discuss your experience working with cross-functional teams, and how you've approached collaboration and communication in the past. Provide examples of successful team projects and any challenges you've faced.
  • Customer Focus: Explain how you've approached understanding and meeting the needs of clients in previous roles. Describe your experience working with clients to build solutions that look past the obvious and drive digital innovation.

Portfolio Presentation Strategy:

  • Case Studies: Highlight your experience with Site Reliability Engineering, cloud platforms, and modern design patterns through case studies demonstrating reliability improvements, incident management, and observability tool integration.
  • Technical Documentation: Include technical documentation, such as code samples, deployment scripts, and root cause analysis documents, to showcase your technical skills and achievements.
  • Presentation Style: Tailor your presentation style to Thoughtworks' collaborative work environment, emphasizing open communication, active listening, and a culture of mutual respect.

πŸ“ Enhancement Note: The interview process for this role is designed to assess the candidate's technical expertise in Site Reliability Engineering, as well as their problem-solving, communication, and collaboration skills. The ideal candidate will have a strong portfolio demonstrating their experience with cloud platforms, modern design patterns, and incident management.

πŸ“Œ Application Steps

To apply for this Senior Service Reliability Engineer (SRE) position at Thoughtworks:

  1. Tailor Your Resume: Highlight your experience with Site Reliability Engineering, cloud platforms, and modern design patterns. Include any relevant certifications or training in Site Reliability Engineering, DevOps, or related fields.
  2. Prepare Your Portfolio: Showcase your experience with Site Reliability Engineering, cloud platforms, and modern design patterns through case studies, projects, and technical documentation. Include any relevant certifications or training in Site Reliability Engineering, DevOps, or related fields.
  3. Practice Technical Challenges: Brush up on your knowledge of Site Reliability Engineering principles, cloud platforms, and modern design patterns. Practice incident management scenarios and develop your problem-solving skills through hands-on exercises and case studies.
  4. Research Thoughtworks: Learn about Thoughtworks' culture, values, and work environment. Prepare for behavioral interview questions by reflecting on your experience working with cross-functional teams, incident management, and driving reliability improvements.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

You have hands-on experience in programming and scripting languages such as Python, Go, or Bash, and a good understanding of Cloud GCP. You are familiar with observability tools and have a strong knowledge of container-based architecture and orchestration tools.