Senior Service Reliability Engineer (SRE)

Thoughtworks
Full_timeβ€’Santiago, Chile

πŸ“ Job Overview

  • Job Title: Senior Service Reliability Engineer (SRE)
  • Company: Thoughtworks
  • Location: Santiago, RegiΓ³n Metropolitana, Chile
  • Job Type: Full-time
  • Category: DevOps, Infrastructure, Web Infrastructure
  • Date Posted: June 11, 2025
  • Experience Level: Mid-Senior level (5-10 years)
  • Remote Status: Remote OK

πŸš€ Role Summary

  • πŸ“ Enhancement Note: This role focuses on ensuring technical excellence and operational efficiency within the infrastructure domain, specializing in reliability, resilience, and system performance. It emphasizes a multifaceted approach, integrating automation, monitoring, and incident response to drive a more customer-focused and agile approach to operations.

  • As a Senior Service Reliability Engineer (SRE GCP), you will champion the principles of Site Reliability Engineering, strategically integrating automation, monitoring, and incident response to improve site reliability, resilience, and system performance. By fostering a collaborative culture and emphasizing shared responsibility, you will enable organizations to meet and exceed their reliability and business objectives.

πŸ’» Primary Responsibilities

  • πŸ“ Enhancement Note: This role involves a wide range of responsibilities, from improving site reliability and handling production incidents to working closely with application development teams and improving system observability.

  • πŸ”‘ Key Responsibilities:

    • Improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect.
    • Drive the integration of observability automation into the CI/CD pipeline.
    • Handle production incidents, manage incident communication with clients, and draft root cause analysis documents.
    • Monitor performance of production systems and improve their scaling to ensure business goals are met within expected SLA and SLO metrics.
    • Work closely with application development teams as advisors on improving system reliability and assisting in implementation for reliability improvements.
    • Improve system observability across multiple facets such as logging and metrics, reducing false alarms to eliminate unnecessary toil and improving process efficiency.
    • Implement chaos engineering practices as necessary to test system reliability, setting up processes for such testing to be done regularly.
    • Understand client goals and business needs, setting direction for site reliability in line with the same (e.g., achieving application availability with minimum/no disruption).

πŸŽ“ Skills & Qualifications

Education: A bachelor's degree in Computer Science, Engineering, or a related field, with a strong focus on computer science and software engineering principles.

Experience: Proven experience (5-10 years) in Site Reliability Engineering, DevOps, or a similar role, with a solid background in programming, scripting, and cloud infrastructure.

Required Skills:

  • Hands-on experience in programming and scripting languages such as Python, Go, or Bash.
  • Good understanding of Cloud GCP.
  • Exposure to observability tools such as Grafana, Datadog, New Relic, ELK Stack, Dynatrace, or equivalent, with proficiency in using data from these tools to dissect and identify root causes of system and infrastructure issues.
  • Familiarity with DevOps and GitOps practices.
  • Good knowledge of container-based architecture and orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc.
  • Understanding of technical architecture and modern design patterns, including microservices, serverless functions, NoSQL, and RESTful APIs, with experience in fixing bugs, analyzing logs, building metrics, and operational dashboards.
  • Familiarity with creating infrastructure resources for improving the reliability of systems that follow Cloud's Well-Architected Framework principles: reliability, security, cost optimization, performance efficiency, and operational excellence.

Preferred Skills:

  • Experience with chaos engineering practices and tools.
  • Knowledge of infrastructure as code (IaC) tools such as Terraform, CloudFormation, or Pulumi.
  • Familiarity with Agile methodologies and CI/CD pipelines.
  • Experience working in a remote or global team environment.

πŸ“ Enhancement Note: Given the role's complexity and the company's size, candidates should have a strong technical background with a proven track record in improving site reliability and driving operational excellence in large-scale systems.

πŸ“Š Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate a strong portfolio showcasing your experience in improving site reliability, with a focus on projects that highlight your ability to build mechanisms for fault tolerance, automate observability, and handle production incidents.
  • Include case studies that illustrate your understanding of client goals and business needs, and how you've set direction for site reliability accordingly.
  • Highlight your proficiency in using data from observability tools to dissect and identify root causes of system and infrastructure issues.

Technical Documentation:

  • Provide clear and concise documentation outlining your approach to improving system reliability, including any chaos engineering practices implemented and the results achieved.
  • Include any relevant code snippets or examples demonstrating your proficiency in programming and scripting languages, as well as your understanding of technical architecture and modern design patterns.
  • Showcase your ability to manage incident communication with clients and draft root cause analysis documents by including examples of how you've handled production incidents in the past.

πŸ’΅ Compensation & Benefits

Salary Range: The estimated salary range for a Senior Service Reliability Engineer (SRE) in Santiago, Chile, is CLP 10,000,000 - CLP 15,000,000 per year (USD 13,500 - USD 20,300 per year). This estimate is based on regional market standards and considers the role's complexity and the company's size.

Benefits:

  • Competitive salary and benefits package.
  • Learning and development opportunities, including numerous development programs and interactive tools to support your career growth.
  • A cultivation culture that empowers employees in their career journeys and fosters collaboration and knowledge-sharing.
  • The strength of a global technology consultancy that integrates strategy, design, and engineering to drive digital innovation.

Working Hours: Full-time position with a standard workweek of 40 hours, with flexibility for on-call rotations and incident response as needed.

🎯 Team & Company Context

Company Culture:

  • Industry: Thoughtworks is a global technology consultancy that integrates strategy, design, and engineering to drive digital innovation, operating in the technology industry.
  • Company Size: With over 25,000 employees worldwide, Thoughtworks offers a large and diverse team environment, providing ample opportunities for collaboration and growth.
  • Founded: Established in 1993, Thoughtworks has a rich history of driving digital innovation and fostering a cultivation culture that empowers employees in their career journeys.

Team Structure:

  • The SRE team at Thoughtworks is responsible for ensuring the reliability, availability, and performance of the company's infrastructure and services.
  • The team consists of experienced SREs, DevOps engineers, and other infrastructure professionals, working collaboratively to drive operational excellence.
  • The SRE team works closely with application development teams, providing guidance and assistance on improving system reliability and implementing reliability improvements.

Development Methodology:

  • Thoughtworks follows Agile methodologies, with a focus on iterative development, continuous integration, and continuous delivery.
  • The company emphasizes a culture of continuous learning, improvement, and innovation, with a strong commitment to fostering a growth mindset.
  • Thoughtworks uses a combination of in-house and third-party tools for version control, CI/CD, monitoring, and incident management.

Company Website: Thoughtworks

πŸ“ Enhancement Note: Thoughtworks' culture emphasizes collaboration, knowledge-sharing, and a strong commitment to driving digital innovation. The company's global presence and diverse team environment provide ample opportunities for professional growth and development.

πŸ“ˆ Career & Growth Analysis

Web Technology Career Level: This role is at the senior level, with a focus on driving operational excellence and ensuring the reliability, availability, and performance of large-scale systems. The ideal candidate will have a proven track record in Site Reliability Engineering, DevOps, or a similar role, with a strong background in programming, scripting, and cloud infrastructure.

Reporting Structure: The Senior Service Reliability Engineer (SRE) reports directly to the SRE Manager or a similar role within the infrastructure organization. The role may also involve working closely with application development teams and other infrastructure professionals, depending on the specific project or client requirements.

Technical Impact: In this role, you will have a significant impact on the reliability, availability, and performance of Thoughtworks' infrastructure and services. Your work will directly contribute to the company's ability to meet and exceed its reliability and business objectives, driving digital innovation and enabling clients to achieve their goals.

Growth Opportunities:

  • Growth Opportunity 1 - Technical Leadership: As a senior member of the SRE team, you will have the opportunity to mentor junior team members, provide guidance to application development teams, and drive technical decision-making processes. This role offers ample opportunities for growth into technical leadership positions within the infrastructure organization.
  • Growth Opportunity 2 - Specialization: Thoughtworks encourages employees to develop specialized skills and expertise in specific areas of interest. As a Senior SRE, you may have the opportunity to specialize in areas such as chaos engineering, infrastructure as code, or cloud architecture, depending on your interests and the company's needs.
  • Growth Opportunity 3 - Global Mobility: With a presence in over 40 countries, Thoughtworks offers opportunities for global mobility, enabling employees to work on diverse projects and gain international experience.

πŸ“ Enhancement Note: Thoughtworks' cultivation culture and commitment to employee growth provide ample opportunities for professional development and career advancement. The company's global presence and diverse project portfolio enable employees to gain experience in various technologies, industries, and cultural environments.

🌐 Work Environment

Office Type: Thoughtworks operates a hybrid work environment, with a combination of on-site and remote work options available, depending on the specific role and team requirements.

Office Location(s): Thoughtworks has offices in Santiago, Chile, as well as other locations throughout the Americas, Europe, Asia, and Australia.

Workspace Context:

  • Workspace Aspect 1 - Collaboration: Thoughtworks emphasizes a collaborative work environment, with open communication and knowledge-sharing encouraged across teams and disciplines. The company's global presence and diverse team environment foster a rich and dynamic workplace culture.
  • Workspace Aspect 2 - Tools and Resources: Thoughtworks provides its employees with access to a wide range of tools and resources, including modern development environments, version control systems, CI/CD pipelines, and monitoring tools. The company also encourages employees to stay up-to-date with the latest technologies and best practices through continuous learning and development opportunities.
  • Workspace Aspect 3 - Work-Life Balance: Thoughtworks is committed to promoting a healthy work-life balance, offering flexible work arrangements and encouraging employees to prioritize their well-being and personal growth.

Work Schedule: Thoughtworks operates on a standard business hours schedule, with some flexibility for on-call rotations and incident response as needed. The company also offers flexible work arrangements, depending on the specific role and team requirements.

πŸ“ Enhancement Note: Thoughtworks' hybrid work environment and commitment to collaboration and knowledge-sharing provide a dynamic and engaging workplace culture. The company's global presence and diverse project portfolio enable employees to gain experience in various technologies, industries, and cultural environments.

πŸ“„ Application & Technical Interview Process

Interview Process:

  • Process Step 1 - Technical Phone Screen: The interview process begins with a technical phone screen, focusing on your understanding of Site Reliability Engineering principles, programming and scripting languages, and cloud infrastructure. Be prepared to discuss your approach to improving site reliability, handling production incidents, and working with observability tools.
  • Process Step 2 - On-site or Virtual Technical Deep Dive: The next stage involves a more in-depth technical assessment, focusing on your ability to build mechanisms for fault tolerance, automate observability, and handle production incidents. You may be asked to provide examples of your past work, discuss your approach to specific technical challenges, and demonstrate your problem-solving skills.
  • Process Step 3 - Behavioral and Cultural Fit Assessment: In this stage, you will have the opportunity to discuss your approach to working in a collaborative team environment, managing incident communication with clients, and setting direction for site reliability in line with client goals and business needs. Be prepared to provide examples of your past experiences and how they have shaped your approach to these aspects of the role.
  • Process Step 4 - Final Evaluation and Offer: The final stage of the interview process involves a comprehensive evaluation of your technical skills, cultural fit, and alignment with Thoughtworks' values and mission. If successful, you will receive an offer for the Senior Service Reliability Engineer (SRE) position.

Portfolio Review Tips:

  • Portfolio Tip 1 - Showcase Your Technical Expertise: Highlight your proficiency in programming and scripting languages, as well as your understanding of technical architecture and modern design patterns. Include examples of your past work that demonstrate your ability to improve site reliability, automate observability, and handle production incidents.
  • Portfolio Tip 2 - Demonstrate Your Problem-Solving Skills: Provide examples of complex technical challenges you've faced in the past and how you've approached them. Be prepared to discuss your thought process, the tools and techniques you used, and the outcomes you achieved.
  • Portfolio Tip 3 - Emphasize Your Collaborative Approach: Thoughtworks places a strong emphasis on collaboration and knowledge-sharing. Highlight your experience working in a team environment, managing incident communication with clients, and setting direction for site reliability in line with client goals and business needs.

Technical Challenge Preparation:

  • Challenge Preparation 1 - Brush Up on Your Technical Skills: Review your proficiency in programming and scripting languages, as well as your understanding of technical architecture and modern design patterns. Familiarize yourself with the latest developments in Site Reliability Engineering, chaos engineering, and cloud infrastructure.
  • Challenge Preparation 2 - Practice Problem-Solving: Engage in practical exercises and case studies that focus on improving site reliability, automating observability, and handling production incidents. This will help you develop your problem-solving skills and prepare you for the technical deep dive stage of the interview process.
  • Challenge Preparation 3 - Prepare for Behavioral Questions: Reflect on your past experiences and how they have shaped your approach to working in a collaborative team environment, managing incident communication with clients, and setting direction for site reliability in line with client goals and business needs. Be prepared to provide specific examples and anecdotes that illustrate your approach to these aspects of the role.

ATS Keywords: [A comprehensive list of web development and server administration-relevant keywords for resume optimization, organized by category: programming languages, web frameworks, server technologies, databases, tools, methodologies, soft skills, industry terms]

πŸ“ Enhancement Note: Thoughtworks' interview process focuses on assessing your technical skills, problem-solving abilities, and cultural fit. By preparing thoroughly and showcasing your expertise in Site Reliability Engineering, you will increase your chances of success in the interview process.

πŸ›  Technology Stack & Web Infrastructure

Frontend Technologies: [N/A - This role focuses on infrastructure and does not involve frontend technologies]

Backend & Server Technologies:

  • Backend Technology 1 - Cloud GCP: Proficiency in Cloud GCP is required for this role, as it is the primary cloud platform used by Thoughtworks.
  • Server Technology 2 - Kubernetes: Familiarity with container-based architecture and orchestration tools such as Kubernetes is essential for this role, as it is a core component of Thoughtworks' infrastructure.
  • Infrastructure Tool 3 - Terraform: Experience with infrastructure as code (IaC) tools such as Terraform is preferred, as it enables the automation and provisioning of infrastructure resources.

Development & DevOps Tools:

  • Development Tool 1 - Git: Thoughtworks uses Git for version control and collaborative development, enabling teams to work together on projects and share code.
  • DevOps Tool 2 - Jenkins: Jenkins is used for continuous integration and continuous delivery, automating the build, test, and deployment processes.
  • Monitoring Tool 3 - Prometheus and Grafana: Thoughtworks uses Prometheus and Grafana for monitoring and alerting, enabling teams to track the performance of production systems and identify potential issues.

πŸ“ Enhancement Note: Thoughtworks' technology stack is designed to support the company's commitment to driving digital innovation and operational excellence. The company's use of modern development environments, version control systems, CI/CD pipelines, and monitoring tools enables teams to work collaboratively and efficiently on projects.

πŸ‘₯ Team Culture & Values

Web Development Values:

  • Web Development Value 1 - Customer Focus: Thoughtworks places a strong emphasis on understanding client goals and business needs, setting direction for site reliability in line with the same. This value is essential for the Senior Service Reliability Engineer (SRE) role, as it involves working closely with clients to ensure their reliability and business objectives are met.
  • Web Development Value 2 - Collaboration: Thoughtworks fosters a collaborative work environment, encouraging open communication and knowledge-sharing across teams and disciplines. This value is crucial for the SRE role, as it involves working closely with application development teams and other infrastructure professionals to drive operational excellence.
  • Web Development Value 3 - Continuous Learning: Thoughtworks is committed to fostering a culture of continuous learning and improvement, encouraging employees to stay up-to-date with the latest technologies and best practices. This value is essential for the SRE role, as it involves staying current with the latest developments in Site Reliability Engineering, chaos engineering, and cloud infrastructure.
  • Web Development Value 4 - Innovation: Thoughtworks encourages employees to think creatively and challenge the status quo, driving digital innovation and pushing the boundaries of what's possible. This value is crucial for the SRE role, as it involves finding new and innovative ways to improve site reliability, automate observability, and handle production incidents.

Collaboration Style:

  • Collaboration Approach 1 - Cross-Functional Integration: Thoughtworks emphasizes cross-functional integration between developers, designers, and stakeholders, fostering a collaborative and inclusive work environment. The SRE role involves working closely with application development teams and other infrastructure professionals, driving operational excellence and ensuring client goals and business needs are met.
  • Collaboration Approach 2 - Code Review Culture: Thoughtworks encourages a code review culture, promoting knowledge-sharing and collective code ownership. This approach is essential for the SRE role, as it involves working collaboratively with application development teams to improve system reliability and implement reliability improvements.
  • Collaboration Approach 3 - Knowledge Sharing: Thoughtworks fosters a culture of knowledge-sharing, encouraging employees to share their expertise and learn from one another. This approach is crucial for the SRE role, as it involves working closely with other infrastructure professionals to drive operational excellence and ensure client goals and business needs are met.

πŸ“ Enhancement Note: Thoughtworks' culture emphasizes collaboration, knowledge-sharing, and a strong commitment to driving digital innovation. The company's global presence and diverse team environment provide ample opportunities for professional growth and development.

⚑ Challenges & Growth Opportunities

Technical Challenges:

  • Web Development Challenge 1 - Improving Site Reliability: As a Senior Service Reliability Engineer (SRE), you will face the challenge of improving site reliability, building mechanisms for fault tolerance, and driving the integration of observability automation into the CI/CD pipeline. This requires a deep understanding of Site Reliability Engineering principles, programming and scripting languages, and cloud infrastructure.
  • Web Development Challenge 2 - Handling Production Incidents: The role involves managing production incidents, managing incident communication with clients, and drafting root cause analysis documents. This requires strong problem-solving skills, a calm and composed demeanor under pressure, and the ability to work collaboratively with application development teams and other infrastructure professionals.
  • Web Development Challenge 3 - Ensuring Business Objectives: The Senior SRE role involves understanding client goals and business needs, setting direction for site reliability in line with the same, and ensuring business objectives are met within expected SLA and SLO metrics. This requires a strong understanding of client goals, business needs, and the ability to balance technical feasibility with business requirements.
  • Web Development Challenge 4 - Staying Current with Emerging Technologies: The field of Site Reliability Engineering is constantly evolving, with new tools, techniques, and best practices emerging regularly. As a Senior SRE, you will need to stay current with the latest developments in the field and adapt your approach to improving site reliability accordingly.

Learning & Development Opportunities:

  • Learning Opportunity 1 - Technical Skill Development: Thoughtworks encourages employees to develop specialized skills and expertise in specific areas of interest. As a Senior SRE, you may have the opportunity to specialize in areas such as chaos engineering, infrastructure as code, or cloud architecture, depending on your interests and the company's needs.
  • Learning Opportunity 2 - Conference Attendance and Certification: Thoughtworks supports employee attendance at industry conferences and events, as well as certification programs that align with their career goals and the company's needs. As a Senior SRE, you may have the opportunity to attend conferences focused on Site Reliability Engineering, chaos engineering, and cloud infrastructure.
  • Learning Opportunity 3 - Technical Mentorship and Leadership Development: Thoughtworks offers mentorship and leadership development programs, enabling employees to grow their technical and leadership skills. As a Senior SRE, you may have the opportunity to mentor junior team members, provide guidance to application development teams, and drive technical decision-making processes.

πŸ“ Enhancement Note: Thoughtworks' commitment to employee growth and development provides ample opportunities for professional advancement and career progression. The company's global presence and diverse project portfolio enable employees to gain experience in various technologies, industries, and cultural environments.

πŸ’‘ Interview Preparation

Technical Questions:

  • Technical Question 1 - Site Reliability Principles: Be prepared to discuss your understanding of Site Reliability Engineering principles, including your approach to improving site reliability, building mechanisms for fault tolerance, and driving the integration of observability automation into the CI/CD pipeline.
  • Technical Question 2 - Production Incident Management: Demonstrate your ability to manage production incidents, manage incident communication with clients, and draft root cause analysis documents. Provide specific examples of how you've handled production incidents in the past and how you've approached incident communication and root cause analysis.
  • Technical Question 3 - Technical Problem-Solving: Showcase your problem-solving skills by discussing your approach to technical challenges, including your thought process, the tools and techniques you use, and the outcomes you achieve. Provide specific examples of complex technical challenges you've faced in the past and how you've approached them.

Company & Culture Questions:

  • Technical Question 4 - Client Goals and Business Needs: Demonstrate your understanding of client goals and business needs, and how you've set direction for site reliability in line with the same. Provide specific examples of how you've worked with clients in the past to ensure their reliability and business objectives were met.
  • Technical Question 5 - Collaboration and Knowledge-Sharing: Highlight your experience working in a collaborative team environment, managing incident communication with clients, and setting direction for site reliability in line with client goals and business needs. Provide specific examples of how you've collaborated with other infrastructure professionals and application development teams to drive operational excellence.
  • Technical Question 6 - Innovation and Digital Transformation: Discuss your approach to driving digital innovation and transformation, including your experience with emerging technologies, chaos engineering, and cloud infrastructure. Provide specific examples of how you've used these approaches to improve site reliability and drive business value for clients.

Portfolio Presentation Strategy:

  • Presentation Strategy 1 - Live Website Demonstration: As a Senior SRE, you will not be presenting a live website demonstration. Instead, focus on showcasing your technical expertise, problem-solving skills, and collaborative approach to improving site reliability.
  • Presentation Strategy 2 - Code Explanation and Architecture Decision Reasoning: Prepare to discuss your code and architecture decisions, explaining your thought process and the rationale behind your approach to improving site reliability, automating observability, and handling production incidents.
  • Presentation Strategy 3 - User Experience Showcase: As a Senior SRE, you will not be presenting a user experience showcase. Instead, focus on demonstrating your understanding of client goals and business needs, and how you've set direction for site reliability in line with the same.

πŸ“ Enhancement Note: Thoughtworks' interview process focuses on assessing your technical skills, problem-solving abilities, and cultural fit. By preparing thoroughly and showcasing your expertise in Site Reliability Engineering, you will increase your chances of success in the interview process.

πŸ“Œ Application Steps

To apply for this Senior Service Reliability Engineer (SRE) position at Thoughtworks, follow these steps:

  1. Concrete Preparation Step 1 - Tailor Your Resume: Customize your resume to highlight your technical skills, problem-solving abilities, and collaborative approach to improving site reliability. Include specific examples of your past work, demonstrating your proficiency in programming and scripting languages, as well as your understanding of technical architecture and modern design patterns.
  2. Concrete Preparation Step 2 - Prepare for Technical Phone Screen: Review your understanding of Site Reliability Engineering principles, programming and scripting languages, and cloud infrastructure. Familiarize yourself with the latest developments in the field and prepare for a technical phone screen that focuses on your approach to improving site reliability, handling production incidents, and working with observability tools.
  3. Concrete Preparation Step 3 - Practice Problem-Solving Exercises: Engage in practical exercises and case studies that focus on improving site reliability, automating observability, and handling production incidents. This will help you develop your problem-solving skills and prepare you for the technical deep dive stage of the interview process.
  4. Concrete Preparation Step 4 - Research Thoughtworks: Familiarize yourself with Thoughtworks' mission, values, and culture. Understand the company's commitment to driving digital innovation and operational excellence, and how the Senior SRE role contributes to these goals. Prepare for behavioral questions that focus on your approach to collaboration, knowledge-sharing, and driving business value for clients.

πŸ“ Enhancement Note: By following these application steps and preparing thoroughly, you will increase your chances of success in the interview process. Thoughtworks' commitment to employee growth and development provides ample opportunities for professional advancement and career progression. The company's global presence and diverse project portfolio enable employees to gain experience in various technologies, industries, and cultural environments.

Application Requirements

You need hands-on experience in programming languages like Python and Go, and a good understanding of Cloud GCP. Familiarity with observability tools and container-based architecture is also required.