📍 Job Overview

Job Title: Principal Site Reliability Engineer
Company: Commonwealth Bank
Location: Bangalore - Manyata Tech Park Road, India
Job Type: Full-Time
Category: DevOps, Site Reliability Engineering
Date Posted: June 25, 2025
Experience Level: 14+ years
Remote Status: On-site

🚀 Role Summary

Lead the adoption of Site Reliability Engineering (SRE) principles and practices across the Group, ensuring reliability is a first-class feature in product and service roadmaps.
Improve the observability of CBA's key products and services, enabling quick identification and resolution of reliability risks.
Continuously improve the operability of services by identifying and engineering strategic optimizations to architecture, infrastructure, release management, and observability processes and tooling.
Engage with operational teams to triage priority incidents, facilitate blameless postmortems, and implement strategic improvements to Group-wide culture, process, and tooling.
Support the 24x7 online environment through participation in on-call rotations and escalation workflows.
Influence executive stakeholders to modify Group processes, platforms, and systems to ensure reliability is a top priority.
Accelerate and standardize the adoption of SRE across the Group's services, infrastructure, systems, and processes by developing and supporting frameworks and tooling.
Mentor Engineering Chapter members, BU stakeholders, and executive leaders on SRE best practices.

📝 Enhancement Note: This role requires a high level of technical expertise and leadership skills to drive SRE adoption and improve reliability across the Group's complex environment. Candidates should have extensive experience in software engineering, observability tools, and modern software development practices.

💻 Primary Responsibilities

Drive SRE Adoption: Lead the adoption of SRE principles and practices across the Group by influencing product and service roadmaps, ensuring reliability is a first-class feature.
Improve Observability: Develop robust processes and strategic tooling to improve the observability of CBA's key products and services, enabling quick identification and resolution of reliability risks.
Optimize Service Operability: Continuously improve the operability of services by identifying and engineering strategic optimizations to architecture, infrastructure, release management, and observability processes and tooling.
Manage Incidents and Improve Processes: Engage with operational teams to triage priority incidents, facilitate blameless postmortems, and implement strategic improvements to Group-wide culture, process, and tooling for better reliability management.
Support 24x7 Environment: Participate in on-call rotations and escalation workflows to support the 24x7 online environment of critical services.
Influence Executive Stakeholders: Modify Group processes, platforms, and systems to ensure reliability is a top priority by influencing executive stakeholders (EM+).
Accelerate SRE Adoption: Develop and support frameworks and tooling to accelerate and standardize the adoption of SRE across the Group's services, infrastructure, systems, and processes.
Mentor and Develop Teams: Mentor Engineering Chapter members, BU stakeholders, and executive leaders (EM+) on SRE best practices to continue the development of the world-class engineering team.

📝 Enhancement Note: This role requires a strong focus on problem-solving, communication, and leadership skills to drive change and improve reliability across the Group's complex environment. Candidates should be experienced in working with multiple teams and stakeholders to deliver results.

🎓 Skills & Qualifications

Education: Bachelor’s degree in engineering, preferably in Computer Science/Information Technology.

Experience: 14+ years of experience in Software Engineering, with expertise in at least one programming language (e.g., Golang, Java, C/C++, .Net, Python, etc.).

Required Skills:

Extensive experience with observability tools such as Prometheus, Grafana, AWS CloudWatch, Splunk, AppDynamics.
In-depth knowledge of Linux internals, networking, containers, and troubleshooting.
Strong experience with modern software development practices using tools such as git for source control and CI/CD tools such as TeamCity, Jenkins, Octopus Deploy, or similar.
Strong public cloud experience in AWS, GCP, or Azure.
Excellent communication and problem-solving skills.
Experience leading teams of engineers and driving outcomes using observability tools.

Preferred Skills:

Experience with inner-source community development and collaboration.
Familiarity with SRE best practices in large organizations.
Strong presentation and public speaking skills.

📝 Enhancement Note: Candidates should have a strong technical background in software engineering, with expertise in observability tools and modern software development practices. Experience in leading teams and driving SRE adoption in large organizations is highly desirable.

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

Demonstrate experience with observability tools and their application in large-scale environments.
Showcase projects that highlight your ability to improve service operability and reliability.
Include examples of incident management and postmortem analysis.
Highlight your experience with public cloud platforms (AWS, GCP, or Azure) and their use in ensuring high availability and reliability.

Technical Documentation:

Provide detailed documentation of your approach to reliability engineering, including processes, tools, and best practices.
Include case studies of strategic optimizations you've implemented to improve service operability and reliability.
Demonstrate your understanding of SRE principles and their application in real-world scenarios.

📝 Enhancement Note: Candidates should focus on showcasing their technical expertise in reliability engineering, observability tools, and cloud platforms. Include examples of their ability to drive change and improve reliability in complex environments.

💵 Compensation & Benefits

Salary Range: INR 2,500,000 - 3,500,000 per annum (Based on experience and market standards for Principal Site Reliability Engineers in Bangalore)

Benefits:

Competitive salary and performance-based bonuses.
Comprehensive health insurance and wellness programs.
Generous superannuation contributions.
Flexible work arrangements and leave entitlements.
Learning and development opportunities, including training, mentoring, and career progression programs.

Working Hours: Full-time (40 hours/week), with on-call rotation and escalation workflows for supporting the 24x7 online environment.

📝 Enhancement Note: Salary range is estimated based on market standards for Principal Site Reliability Engineers in Bangalore, considering the candidate's experience level and the role's complexity. Benefits are tailored to attract and retain top talent in the tech industry.

🎯 Team & Company Context

🏢 Company Culture

Industry: Financial Services

Company Size: Large (Over 50,000 employees)

Founded: 1911

Team Structure:

The role reports directly to the Head of Site Reliability Engineering, Group Technology.
The team consists of multiple Site Reliability Engineers, working closely with various Business Units (BUs) and operational teams to ensure reliability is a top priority.
The team collaborates with designers, marketers, and other stakeholders to deliver reliable and user-focused services.

Development Methodology:

Agile/Scrum methodologies for sprint planning and project management.
Code reviews, testing, and quality assurance practices to ensure code quality and reliability.
Deployment strategies, CI/CD pipelines, and automated testing for efficient and reliable releases.

Company Website: www.commbank.com.au

📝 Enhancement Note: The company's large size and complex environment present both challenges and opportunities for the Principal Site Reliability Engineer. The role requires strong collaboration and communication skills to work effectively with multiple teams and stakeholders.

📈 Career & Growth Analysis

Web Technology Career Level: Principal Site Reliability Engineer - Leads the adoption of SRE principles and practices across the Group, ensuring reliability is a top priority in product and service roadmaps.

Reporting Structure: Reports directly to the Head of Site Reliability Engineering, Group Technology, and influences executive stakeholders (EM+) to modify Group processes, platforms, and systems.

Technical Impact: Drives the adoption of SRE principles and practices across the Group, improving the reliability of CBA's key products and services, and ensuring minimal friction in service delivery.

Growth Opportunities:

Technical Leadership: Grow into a more senior role, such as Head of Site Reliability Engineering or Chief Reliability Officer, driving SRE adoption and strategy across the Group.
Architecture & Design: Specialize in architecture and design, focusing on ensuring the reliability of large-scale, complex systems.
Emerging Technologies: Explore and adopt emerging technologies to improve reliability and service operability, staying at the forefront of SRE best practices.

📝 Enhancement Note: The role offers significant growth opportunities, both in technical leadership and specialization. Candidates should be eager to take on challenges and drive change in a large, complex environment.

🌐 Work Environment

Office Type: Modern, collaborative workspace with state-of-the-art technology and amenities.

Office Location(s): Manyata Tech Park, Bangalore

Workspace Context:

Collaborative workspaces with multiple monitors and testing devices available.
Cross-functional integration with designers, marketers, and other stakeholders to deliver user-focused services.
Flexible work arrangements, including remote work options and flexible hours.

Work Schedule: Full-time (40 hours/week), with on-call rotation and escalation workflows for supporting the 24x7 online environment.

📝 Enhancement Note: The work environment is designed to foster collaboration and innovation, with modern facilities and flexible work arrangements to support a healthy work-life balance.

📄 Application & Technical Interview Process

Interview Process:

Technical Assessment (1 hour): Demonstrate your expertise in observability tools, Linux internals, networking, and troubleshooting through hands-on exercises and case studies.
Architecture & Design Discussion (1 hour): Present your approach to ensuring the reliability of large-scale, complex systems, and discuss your experience with SRE best practices.
Behavioral & Cultural Fit Interview (1 hour): Discuss your problem-solving skills, communication style, and cultural fit with the team and organization.
Final Evaluation & Next Steps (30 minutes): Review your technical assessment and discuss your career aspirations and growth opportunities within the organization.

Portfolio Review Tips:

Demonstrate Your Expertise: Showcase your experience with observability tools, cloud platforms, and SRE best practices through real-world examples and case studies.
Highlight Your Leadership Skills: Include examples of your ability to lead teams, drive change, and influence stakeholders to improve reliability.
Focus on Results: Emphasize the tangible outcomes and improvements you've delivered in previous roles, quantifying the impact where possible.

Technical Challenge Preparation:

Brush Up on Your Technical Skills: Revisit your knowledge of observability tools, Linux internals, networking, and troubleshooting to ensure you're up-to-date with the latest best practices and trends.
Practice Problem-Solving: Work through real-world scenarios and case studies to hone your problem-solving skills and ability to think critically under pressure.
Prepare for Behavioral Questions: Reflect on your past experiences and be ready to discuss your approach to problem-solving, communication, and leadership in a team environment.

ATS Keywords: (Organized by category)

Programming Languages: Golang, Java, C/C++, .Net, Python
Observability Tools: Prometheus, Grafana, AWS CloudWatch, Splunk, AppDynamics
Cloud Platforms: AWS, GCP, Azure
CI/CD Tools: TeamCity, Jenkins, Octopus Deploy
Version Control: Git
Troubleshooting: Linux internals, networking, containers
Soft Skills: Communication, problem-solving, leadership, teamwork, collaboration
Industry Terms: Site Reliability Engineering (SRE), observability, reliability, incident management, postmortem, cloud platforms, CI/CD, DevOps

📝 Enhancement Note: The interview process is designed to assess the candidate's technical expertise, problem-solving skills, and cultural fit within the organization. Candidates should be prepared to discuss their experience with SRE best practices, observability tools, and cloud platforms in detail.

🛠 Technology Stack & Web Infrastructure

Observability Tools:

Prometheus (time-series database and monitoring tool)
Grafana (visualization and alerting tool)
AWS CloudWatch (cloud-based monitoring and observability service)
Splunk (data analytics and monitoring platform)
AppDynamics (application performance monitoring and analytics)

Cloud Platforms:

AWS (Amazon Web Services)
GCP (Google Cloud Platform)
Azure (Microsoft Azure)

CI/CD Tools:

TeamCity (build management and continuous integration server)
Jenkins (open-source automation server)
Octopus Deploy (deployment automation tool)

📝 Enhancement Note: The technology stack includes industry-leading observability tools, cloud platforms, and CI/CD tools to ensure the reliability and scalability of CBA's services. Candidates should have experience with these tools and be comfortable working in a complex, large-scale environment.

👥 Team Culture & Values

Web Development Values:

Reliability: Prioritize reliability in all aspects of service design, development, and operation.
Simplicity: Focus on simplicity and ease of use in all user interactions and experiences.
Collaboration: Foster a culture of collaboration and knowledge-sharing across teams and disciplines.
Innovation: Encourage continuous learning and exploration of emerging technologies to improve reliability and service operability.

Collaboration Style:

Cross-functional Integration: Work closely with designers, marketers, and other stakeholders to deliver user-focused services and ensure reliability is a top priority.
Code Review Culture: Encourage peer review and knowledge-sharing to improve code quality and reliability.
Mentoring & Knowledge-Sharing: Foster a culture of mentoring and knowledge-sharing to develop the skills and expertise of team members.

📝 Enhancement Note: The team culture emphasizes reliability, simplicity, collaboration, and innovation. Candidates should be comfortable working in a collaborative, user-focused environment and be eager to drive change and improve reliability across the organization.

⚡ Challenges & Growth Opportunities

Technical Challenges:

Complex Environment: Work in a large, complex environment with multiple services, teams, and stakeholders, requiring strong communication and collaboration skills.
Scalability & Performance: Ensure the reliability and scalability of services under high load and traffic, requiring a deep understanding of system design and architecture.
Incident Management: Manage priority incidents and facilitate blameless postmortems, requiring strong problem-solving and communication skills.
Emerging Technologies: Stay up-to-date with emerging technologies and best practices in SRE, requiring continuous learning and adaptation.

Learning & Development Opportunities:

Technical Skill Development: Deepen your expertise in observability tools, cloud platforms, and SRE best practices through training, workshops, and online resources.
Leadership & Mentoring: Develop your leadership and mentoring skills by working with and guiding team members, as well as influencing executive stakeholders.
Architecture & Design: Specialize in architecture and design, focusing on ensuring the reliability of large-scale, complex systems.

📝 Enhancement Note: The role presents significant technical challenges and growth opportunities. Candidates should be eager to take on complex problems and drive change in a large, dynamic environment.

💡 Interview Preparation

Technical Questions:

Observability Tools: Describe your experience with Prometheus, Grafana, AWS CloudWatch, Splunk, and AppDynamics. Discuss their strengths, weaknesses, and use cases in a large-scale environment.
Cloud Platforms: Compare and contrast AWS, GCP, and Azure. Discuss their features, services, and best practices for ensuring reliability and scalability in a complex environment.
Incident Management: Walk through your approach to incident management, including triage, diagnosis, resolution, and postmortem analysis. Discuss your experience with blameless postmortems and continuous improvement.

Company & Culture Questions:

SRE Adoption: Discuss your experience driving SRE adoption in a large organization. What strategies and tactics did you use to influence product and service roadmaps and ensure reliability was a top priority?
Team Dynamics: Describe your experience working with multiple teams and stakeholders to deliver reliable services. How do you ensure effective communication and collaboration in a complex environment?
User Experience: Explain how you prioritize user experience in your approach to reliability engineering. How do you ensure that reliability improvements do not negatively impact the user experience?

Portfolio Presentation Strategy:

Demonstrate Your Expertise: Showcase your experience with observability tools, cloud platforms, and SRE best practices through real-world examples and case studies.
Highlight Your Leadership Skills: Include examples of your ability to lead teams, drive change, and influence stakeholders to improve reliability.
Focus on Results: Emphasize the tangible outcomes and improvements you've delivered in previous roles, quantifying the impact where possible.

📝 Enhancement Note: The interview process is designed to assess the candidate's technical expertise, problem-solving skills, and cultural fit within the organization. Candidates should be prepared to discuss their experience with SRE best practices, observability tools, and cloud platforms in detail, as well as their approach to incident management, team dynamics, and user experience.

📌 Application Steps

To apply for this Principal Site Reliability Engineer position:

Customize Your Portfolio: Tailor your portfolio to highlight your experience with observability tools, cloud platforms, and SRE best practices. Include real-world examples and case studies that demonstrate your ability to drive change and improve reliability in a complex environment.
Optimize Your Resume: Highlight your technical skills, experience, and achievements in reliability engineering, observability tools, and cloud platforms. Include relevant keywords and phrases to improve search relevance and optimize your resume for ATS systems.
Prepare for Technical Challenges: Brush up on your technical skills, practice problem-solving, and be ready to discuss your approach to incident management, team dynamics, and user experience.
Research the Company: Learn about Commonwealth Bank's history, culture, and values. Understand the organization's approach to reliability engineering and be prepared to discuss how your experience and skills align with their goals and objectives.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web technology industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Principal Site Reliability Engineer