Site Reliability Engineer at Etihad

📍 Job Overview

Job Title: Site Reliability Engineer
Company: Etihad
Location: Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
Job Type: On-site
Category: DevOps, Site Reliability Engineering
Date Posted: 2025-08-01
Experience Level: 5-10 years

🚀 Role Summary

Lead an SRE squad focused on enhancing service reliability, performance, and scalability
Drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts
Build monitoring systems, optimize infrastructure, and implement safe deployment practices
Ensure alignment with SLAs/SLOs and contribute to system development and code reviews
Requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance

💻 Primary Responsibilities

Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options.
Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field.

Experience: 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.

Required Skills:

Experience working in computing, distributed systems, storage, or networking
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
Ability to debug, optimize code, and automate routine tasks
Systematic problem-solving approach, coupled with effective verbal and written communication skills
Strong communication capability, able to articulate technical issues in terms of business risk and opportunity
Knowledge of the technical aspects of cloud computing, data centers, networks, and virtual infrastructure
Strong analytical and problem-solving skills, TSM processes, and tools

Preferred Skills:

Experience with specific cloud platforms (e.g., AWS, GCP, Azure)
Familiarity with infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation)
Knowledge of configuration management tools (e.g., Ansible, Puppet)
Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog)
Familiarity with CI/CD pipelines and deployment automation

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

Demonstrate experience with large-scale distributed systems, cloud infrastructure, and IT governance
Showcase projects that highlight your ability to drive automation, optimize performance, and manage incident resolution
Include examples of your problem-solving approach and communication skills in the context of business risk and opportunity

Technical Documentation:

Provide code samples showcasing your ability to debug, optimize, and automate routine tasks
Include documentation demonstrating your understanding of TSM processes and tools
Showcase your experience with monitoring systems, incident management, and root cause analysis

💵 Compensation & Benefits

Salary Range: AED 45,000 - 60,000 per month (Estimated based on market research for the role and location)

Benefits:

Competitive salary package
Generous annual leave and travel benefits
Comprehensive health insurance and wellness programs
Retirement savings plan and pension scheme
Employee discounts on Etihad flights and services
Opportunities for career progression and professional development

Working Hours: Full-time, 40 hours per week, with flexible working hours and on-call rotation for incident management

🎯 Team & Company Context

🏢 Company Culture

Industry: Aviation, Travel & Hospitality

Company Size: Large (Over 10,000 employees)

Founded: 2003

Team Structure:

The SRE team is part of the broader IT organization, working closely with development, operations, and business teams
The SRE squad consists of 8-10 engineers, led by the Site Reliability Engineer (this role)

Development Methodology:

Agile/Scrum methodologies for software development and project management
Infrastructure as code (IaC) and continuous integration/continuous deployment (CI/CD) pipelines for automated deployment and testing
Regular code reviews, pair programming, and knowledge sharing sessions

Company Website: Etihad Airways

📈 Career & Growth Analysis

Web Technology Career Level: Senior Site Reliability Engineer, responsible for leading an SRE squad and driving operational excellence

Reporting Structure: Reports directly to the Head of Site Reliability Engineering, with dotted-line reporting to senior management for project-specific initiatives

Technical Impact: Ensures high system uptime, performance, and scalability; contributes to architecture and design decisions; drives automation and cost optimization

Growth Opportunities:

Technical Leadership: Transition to a Principal SRE role, leading multiple squads and driving strategic initiatives
Architecture & Design: Move into an architectural role, focusing on system design, scalability, and performance optimization
Management & Strategy: Progress into a management role, overseeing multiple teams and driving operational excellence across the organization

🌐 Work Environment

Office Type: Modern, open-plan office with collaborative workspaces and dedicated team areas

Office Location(s): Etihad Airways Headquarters, Abu Dhabi, United Arab Emirates

Workspace Context:

Ergonomic workstations with multiple monitors and high-speed connectivity
Access to specialized tools, software, and development environments
Collaborative workspaces with whiteboard walls, meeting rooms, and breakout areas

Work Schedule: Flexible working hours with core hours from 9:00 AM to 5:00 PM; on-call rotation for incident management and monitoring

📄 Application & Technical Interview Process

Interview Process:

Phone Screen: A brief phone call to assess communication skills and cultural fit (30 minutes)
Technical Deep Dive: A detailed discussion of your technical skills, experience, and problem-solving approach (60 minutes)
Behavioral & Situational Interview: An in-depth assessment of your leadership, communication, and teamwork skills (60 minutes)
Final Interview: A meeting with senior management to discuss your career aspirations and fit within the organization (30 minutes)

Portfolio Review Tips:

Highlight projects that demonstrate your ability to lead an SRE squad, drive automation, and optimize system performance
Showcase your problem-solving approach, communication skills, and understanding of business risk and opportunity
Include examples of your experience with large-scale distributed systems, cloud infrastructure, and IT governance

Technical Challenge Preparation:

Brush up on your knowledge of large-scale distributed systems, cloud infrastructure, and IT governance
Practice problem-solving exercises and algorithm challenges to sharpen your analytical skills
Familiarize yourself with the latest trends and best practices in site reliability engineering

ATS Keywords: Site Reliability Engineering, DevOps, Cloud Infrastructure, Distributed Systems, IT Governance, Large-Scale Systems, Automation, Incident Management, Performance Optimization, Team Leadership, Problem Solving, Communication, Agile, Scrum, CI/CD, IaC, Monitoring, Observability, RCA, MTTR, SLA, SLO

📌 Application Steps

To apply for this Site Reliability Engineer position:

Update Your Resume: Highlight your relevant experience with large-scale distributed systems, cloud infrastructure, and IT governance; emphasize your problem-solving approach, communication skills, and leadership abilities.
Tailor Your Portfolio: Showcase projects that demonstrate your ability to drive automation, optimize system performance, and manage incident resolution; include examples of your problem-solving approach and communication skills in the context of business risk and opportunity.
Prepare for Phone Screen: Practice common interview questions and brush up on your technical skills and experience.
Research Etihad Airways: Familiarize yourself with the company's mission, values, and culture; understand the aviation and travel industry and Etihad's role within it.
Prepare for Technical Deep Dive & Behavioral Interview: Rehearse your responses to behavioral and situational interview questions, focusing on your leadership, communication, and teamwork skills.
Finalize Your Application: Submit your application through the Etihad Airways careers portal, ensuring all required documents are included.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Site Reliability Engineer