Site Reliability Engineer
📍 Job Overview
- Job Title: Site Reliability Engineer
- Company: Etihad
- Location: Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
- Job Type: On-site
- Category: DevOps, Site Reliability Engineering
- Date Posted: 2025-08-01
- Experience Level: 5-10 years
🚀 Role Summary
- Lead an SRE squad focused on enhancing service reliability, performance, and scalability
- Drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts
- Build monitoring systems, optimize infrastructure, and implement safe deployment practices
- Ensure alignment with SLAs/SLOs and contribute to system development and code reviews
- Requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance
💻 Primary Responsibilities
- Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
- Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
- Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
- Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
- Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options.
- Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
- Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
- Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
- Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
- Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
🎓 Skills & Qualifications
Education: Bachelor's degree in Computer Science, Engineering, or a related field.
Experience: 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
Required Skills:
- Experience working in computing, distributed systems, storage, or networking
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
- Ability to debug, optimize code, and automate routine tasks
- Systematic problem-solving approach, coupled with effective verbal and written communication skills
- Strong communication capability, able to articulate technical issues in terms of business risk and opportunity
- Knowledge of the technical aspects of cloud computing, data centers, networks, and virtual infrastructure
- Strong analytical and problem-solving skills, TSM processes, and tools
Preferred Skills:
- Experience with specific cloud platforms (e.g., AWS, GCP, Azure)
- Familiarity with infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation)
- Knowledge of configuration management tools (e.g., Ansible, Puppet)
- Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog)
- Familiarity with CI/CD pipelines and deployment automation
📊 Web Portfolio & Project Requirements
Portfolio Essentials:
- Demonstrate experience with large-scale distributed systems, cloud infrastructure, and IT governance
- Showcase projects that highlight your ability to drive automation, optimize performance, and manage incident resolution
- Include examples of your problem-solving approach and communication skills in the context of business risk and opportunity
Technical Documentation:
- Provide code samples showcasing your ability to debug, optimize, and automate routine tasks
- Include documentation demonstrating your understanding of TSM processes and tools
- Showcase your experience with monitoring systems, incident management, and root cause analysis
💵 Compensation & Benefits
Salary Range: AED 45,000 - 60,000 per month (Estimated based on market research for the role and location)
Benefits:
- Competitive salary package
- Generous annual leave and travel benefits
- Comprehensive health insurance and wellness programs
- Retirement savings plan and pension scheme
- Employee discounts on Etihad flights and services
- Opportunities for career progression and professional development
Working Hours: Full-time, 40 hours per week, with flexible working hours and on-call rotation for incident management
🎯 Team & Company Context
🏢 Company Culture
Industry: Aviation, Travel & Hospitality
Company Size: Large (Over 10,000 employees)
Founded: 2003
Team Structure:
- The SRE team is part of the broader IT organization, working closely with development, operations, and business teams
- The SRE squad consists of 8-10 engineers, led by the Site Reliability Engineer (this role)
Development Methodology:
- Agile/Scrum methodologies for software development and project management
- Infrastructure as code (IaC) and continuous integration/continuous deployment (CI/CD) pipelines for automated deployment and testing
- Regular code reviews, pair programming, and knowledge sharing sessions
Company Website: Etihad Airways
📈 Career & Growth Analysis
Web Technology Career Level: Senior Site Reliability Engineer, responsible for leading an SRE squad and driving operational excellence
Reporting Structure: Reports directly to the Head of Site Reliability Engineering, with dotted-line reporting to senior management for project-specific initiatives
Technical Impact: Ensures high system uptime, performance, and scalability; contributes to architecture and design decisions; drives automation and cost optimization
Growth Opportunities:
- Technical Leadership: Transition to a Principal SRE role, leading multiple squads and driving strategic initiatives
- Architecture & Design: Move into an architectural role, focusing on system design, scalability, and performance optimization
- Management & Strategy: Progress into a management role, overseeing multiple teams and driving operational excellence across the organization
🌐 Work Environment
Office Type: Modern, open-plan office with collaborative workspaces and dedicated team areas
Office Location(s): Etihad Airways Headquarters, Abu Dhabi, United Arab Emirates
Workspace Context:
- Ergonomic workstations with multiple monitors and high-speed connectivity
- Access to specialized tools, software, and development environments
- Collaborative workspaces with whiteboard walls, meeting rooms, and breakout areas
Work Schedule: Flexible working hours with core hours from 9:00 AM to 5:00 PM; on-call rotation for incident management and monitoring
📄 Application & Technical Interview Process
Interview Process:
- Phone Screen: A brief phone call to assess communication skills and cultural fit (30 minutes)
- Technical Deep Dive: A detailed discussion of your technical skills, experience, and problem-solving approach (60 minutes)
- Behavioral & Situational Interview: An in-depth assessment of your leadership, communication, and teamwork skills (60 minutes)
- Final Interview: A meeting with senior management to discuss your career aspirations and fit within the organization (30 minutes)
Portfolio Review Tips:
- Highlight projects that demonstrate your ability to lead an SRE squad, drive automation, and optimize system performance
- Showcase your problem-solving approach, communication skills, and understanding of business risk and opportunity
- Include examples of your experience with large-scale distributed systems, cloud infrastructure, and IT governance
Technical Challenge Preparation:
- Brush up on your knowledge of large-scale distributed systems, cloud infrastructure, and IT governance
- Practice problem-solving exercises and algorithm challenges to sharpen your analytical skills
- Familiarize yourself with the latest trends and best practices in site reliability engineering
ATS Keywords: Site Reliability Engineering, DevOps, Cloud Infrastructure, Distributed Systems, IT Governance, Large-Scale Systems, Automation, Incident Management, Performance Optimization, Team Leadership, Problem Solving, Communication, Agile, Scrum, CI/CD, IaC, Monitoring, Observability, RCA, MTTR, SLA, SLO
📌 Application Steps
To apply for this Site Reliability Engineer position:
- Update Your Resume: Highlight your relevant experience with large-scale distributed systems, cloud infrastructure, and IT governance; emphasize your problem-solving approach, communication skills, and leadership abilities.
- Tailor Your Portfolio: Showcase projects that demonstrate your ability to drive automation, optimize system performance, and manage incident resolution; include examples of your problem-solving approach and communication skills in the context of business risk and opportunity.
- Prepare for Phone Screen: Practice common interview questions and brush up on your technical skills and experience.
- Research Etihad Airways: Familiarize yourself with the company's mission, values, and culture; understand the aviation and travel industry and Etihad's role within it.
- Prepare for Technical Deep Dive & Behavioral Interview: Rehearse your responses to behavioral and situational interview questions, focusing on your leadership, communication, and teamwork skills.
- Finalize Your Application: Submit your application through the Etihad Airways careers portal, ensuring all required documents are included.
⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
Candidates should have 7+ years of experience in software development and 3+ years in a DevOps or SRE role. Expertise in large-scale distributed systems and cloud infrastructure is essential.