Site Reliability Engineer

Etihad
Full_timeAbu Dhabi, United Arab Emirates

📍 Job Overview

  • Job Title: Site Reliability Engineer
  • Company: Etihad
  • Location: Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
  • Job Type: On-site
  • Category: DevOps, Site Reliability Engineering
  • Date Posted: 2025-08-01
  • Experience Level: 5-10 years

🚀 Role Summary

  • Lead an SRE squad focused on enhancing service reliability, performance, and scalability
  • Drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts
  • Build monitoring systems, optimize infrastructure, and implement safe deployment practices
  • Ensure alignment with SLAs/SLOs and contribute to system development and code reviews
  • Requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance

💻 Primary Responsibilities

  • Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
  • Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
  • Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
  • Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
  • Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options.
  • Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
  • Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
  • Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
  • Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
  • Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field.

Experience: 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.

Required Skills:

  • Experience working in computing, distributed systems, storage, or networking
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
  • Ability to debug, optimize code, and automate routine tasks
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills
  • Strong communication capability, able to articulate technical issues in terms of business risk and opportunity
  • Knowledge of the technical aspects of cloud computing, data centers, networks, and virtual infrastructure
  • Strong analytical and problem-solving skills, TSM processes, and tools

Preferred Skills:

  • Experience with specific cloud platforms (e.g., AWS, GCP, Azure)
  • Familiarity with infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation)
  • Knowledge of configuration management tools (e.g., Ansible, Puppet)
  • Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog)
  • Familiarity with CI/CD pipelines and deployment automation

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate experience with large-scale distributed systems, cloud infrastructure, and IT governance
  • Showcase projects that highlight your ability to drive automation, optimize performance, and manage incident resolution
  • Include examples of your problem-solving approach and communication skills in the context of business risk and opportunity

Technical Documentation:

  • Provide code samples showcasing your ability to debug, optimize, and automate routine tasks
  • Include documentation demonstrating your understanding of TSM processes and tools
  • Showcase your experience with monitoring systems, incident management, and root cause analysis

💵 Compensation & Benefits

Salary Range: AED 45,000 - 60,000 per month (Estimated based on market research for the role and location)

Benefits:

  • Competitive salary package
  • Generous annual leave and travel benefits
  • Comprehensive health insurance and wellness programs
  • Retirement savings plan and pension scheme
  • Employee discounts on Etihad flights and services
  • Opportunities for career progression and professional development

Working Hours: Full-time, 40 hours per week, with flexible working hours and on-call rotation for incident management

🎯 Team & Company Context

🏢 Company Culture

Industry: Aviation, Travel & Hospitality

Company Size: Large (Over 10,000 employees)

Founded: 2003

Team Structure:

  • The SRE team is part of the broader IT organization, working closely with development, operations, and business teams
  • The SRE squad consists of 8-10 engineers, led by the Site Reliability Engineer (this role)

Development Methodology:

  • Agile/Scrum methodologies for software development and project management
  • Infrastructure as code (IaC) and continuous integration/continuous deployment (CI/CD) pipelines for automated deployment and testing
  • Regular code reviews, pair programming, and knowledge sharing sessions

Company Website: Etihad Airways

📈 Career & Growth Analysis

Web Technology Career Level: Senior Site Reliability Engineer, responsible for leading an SRE squad and driving operational excellence

Reporting Structure: Reports directly to the Head of Site Reliability Engineering, with dotted-line reporting to senior management for project-specific initiatives

Technical Impact: Ensures high system uptime, performance, and scalability; contributes to architecture and design decisions; drives automation and cost optimization

Growth Opportunities:

  • Technical Leadership: Transition to a Principal SRE role, leading multiple squads and driving strategic initiatives
  • Architecture & Design: Move into an architectural role, focusing on system design, scalability, and performance optimization
  • Management & Strategy: Progress into a management role, overseeing multiple teams and driving operational excellence across the organization

🌐 Work Environment

Office Type: Modern, open-plan office with collaborative workspaces and dedicated team areas

Office Location(s): Etihad Airways Headquarters, Abu Dhabi, United Arab Emirates

Workspace Context:

  • Ergonomic workstations with multiple monitors and high-speed connectivity
  • Access to specialized tools, software, and development environments
  • Collaborative workspaces with whiteboard walls, meeting rooms, and breakout areas

Work Schedule: Flexible working hours with core hours from 9:00 AM to 5:00 PM; on-call rotation for incident management and monitoring

📄 Application & Technical Interview Process

Interview Process:

  1. Phone Screen: A brief phone call to assess communication skills and cultural fit (30 minutes)
  2. Technical Deep Dive: A detailed discussion of your technical skills, experience, and problem-solving approach (60 minutes)
  3. Behavioral & Situational Interview: An in-depth assessment of your leadership, communication, and teamwork skills (60 minutes)
  4. Final Interview: A meeting with senior management to discuss your career aspirations and fit within the organization (30 minutes)

Portfolio Review Tips:

  • Highlight projects that demonstrate your ability to lead an SRE squad, drive automation, and optimize system performance
  • Showcase your problem-solving approach, communication skills, and understanding of business risk and opportunity
  • Include examples of your experience with large-scale distributed systems, cloud infrastructure, and IT governance

Technical Challenge Preparation:

  • Brush up on your knowledge of large-scale distributed systems, cloud infrastructure, and IT governance
  • Practice problem-solving exercises and algorithm challenges to sharpen your analytical skills
  • Familiarize yourself with the latest trends and best practices in site reliability engineering

ATS Keywords: Site Reliability Engineering, DevOps, Cloud Infrastructure, Distributed Systems, IT Governance, Large-Scale Systems, Automation, Incident Management, Performance Optimization, Team Leadership, Problem Solving, Communication, Agile, Scrum, CI/CD, IaC, Monitoring, Observability, RCA, MTTR, SLA, SLO

📌 Application Steps

To apply for this Site Reliability Engineer position:

  1. Update Your Resume: Highlight your relevant experience with large-scale distributed systems, cloud infrastructure, and IT governance; emphasize your problem-solving approach, communication skills, and leadership abilities.
  2. Tailor Your Portfolio: Showcase projects that demonstrate your ability to drive automation, optimize system performance, and manage incident resolution; include examples of your problem-solving approach and communication skills in the context of business risk and opportunity.
  3. Prepare for Phone Screen: Practice common interview questions and brush up on your technical skills and experience.
  4. Research Etihad Airways: Familiarize yourself with the company's mission, values, and culture; understand the aviation and travel industry and Etihad's role within it.
  5. Prepare for Technical Deep Dive & Behavioral Interview: Rehearse your responses to behavioral and situational interview questions, focusing on your leadership, communication, and teamwork skills.
  6. Finalize Your Application: Submit your application through the Etihad Airways careers portal, ensuring all required documents are included.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

Candidates should have 7+ years of experience in software development and 3+ years in a DevOps or SRE role. Expertise in large-scale distributed systems and cloud infrastructure is essential.