Cloud Site Reliability Engineer
π Job Overview
- Job Title: Cloud Site Reliability Engineer
- Company: NICE
- Location: Pune, MahΔrΔshtra, India
- Job Type: Hybrid (2 days office, 3 days remote)
- Category: DevOps, Site Reliability Engineering
- Date Posted: June 25, 2025
- Experience Level: 2-5 years
- Remote Status: On-site with remote flexibility
π Role Summary
- Key Responsibilities: Ensure cloud platforms are observable, measurable, reliable, scalable, and maintainable. Lead investigations into root cause outages, performance, and cost issues. Develop automation for low-value tasks and provide technical leadership to wider Cloud Operations and Support teams.
- Key Technologies: Azure, Kubernetes, Prometheus, Grafana, Bicep, Git, MS-SQL, Elasticsearch, YML, JSON, XML, C#, PowerShell, Azure DevOps pipelines, NUnit, Jasmine, Selenium.
π Enhancement Note: This role requires a strong background in Site Reliability Engineering (SRE) and a deep understanding of cloud platforms, databases, and monitoring tools. Experience with Azure and Kubernetes is particularly valuable for this position.
π» Primary Responsibilities
- Ensure Cloud Platform Reliability: Act as a 'gatekeeper' for production, managing the work backlog, and developing reliability improvements.
- Investigate Outages and Performance Issues: Lead root cause analysis for outages, performance, and cost issues, and drive reliability improvements.
- Develop Automation: Lead initiatives to automate low-value tasks, balancing project delivery demands.
- Provide Technical Leadership: Offer guidance and oversight to Cloud Operations and Support teams, as well as the products and services they support.
- Configure Monitoring Dashboards and Alerts: Develop and configure monitoring dashboards and alerts in tools like Grafana and Azure Monitor.
- Install and Configure Observability Platform: Install and configure observability platforms, including tools like Grafana, Prometheus, Azure Monitor, and OpenTelemetry.
- Develop Bicep Modules for Monitoring Infrastructure: Develop Bicep modules for monitoring infrastructure and deploy them.
π Enhancement Note: This role requires a strong focus on problem-solving, troubleshooting, and driving reliability improvements. Experience with incident management and post-mortem analysis is essential for success in this role.
π Skills & Qualifications
Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.
Experience: Proven experience (2+ years) in Site Reliability Engineering, Cloud Engineering, or a similar role. Experience with Azure and Kubernetes is a plus.
Required Skills:
- Excellent technical, analytical, and troubleshooting skills
- In-depth knowledge of databases and data handling (MS-SQL, Elasticsearch, YML, JSON, XML)
- Strong programming or advanced scripting skills (C#, PowerShell)
- Experience with infrastructure/configuration as code and version control (ARM, Bicep, Git)
- Experience managing monitoring, alerting, and dashboarding platforms (Azure Monitor, Prometheus, Grafana, Elasticsearch)
- Demonstrated experience supporting live cloud services and platforms
- Production experience with Kubernetes and containerization
- Implementation and support of service level objectives (SLOs)
- Exposure to commercial cloud providers (Ideally Azure, others considered)
- Exposure to Azure DevOps pipelines (CI/CD)
- Exposure to test frameworks (NUnit, Jasmine, Selenium)
Preferred Skills:
- Experience with incident management and post-mortem analysis
- Familiarity with commercial cloud providers other than Azure
- Experience with additional programming languages or scripting tools
π Enhancement Note: This role requires a strong technical skillset with a focus on cloud platforms, databases, and monitoring tools. Experience with incident management and post-mortem analysis is a significant advantage.
π Web Portfolio & Project Requirements
Portfolio Essentials:
- Cloud Platform Reliability Projects: Include projects demonstrating your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
- Incident Management Case Studies: Highlight your experience with incident management, root cause analysis, and post-mortem analysis.
- Automation and Scripting Examples: Showcase your automation and scripting skills, with a focus on infrastructure as code and version control.
Technical Documentation:
- Code Quality and Documentation: Demonstrate your commitment to code quality, commenting, and documentation standards.
- Version Control and Deployment Processes: Highlight your experience with version control, deployment processes, and server configuration.
- Testing Methodologies: Showcase your understanding of testing methodologies, performance metrics, and optimization techniques.
π Enhancement Note: This role requires a strong focus on cloud platform reliability, incident management, and automation. Your portfolio should demonstrate your ability to drive reliability improvements and manage complex cloud environments.
π΅ Compensation & Benefits
Salary Range: INR 1,200,000 - 1,800,000 per annum (Based on experience and skills)
Benefits:
- Competitive salary and benefits package
- Flexible working hours and remote work options
- Opportunities for professional growth and development
- A dynamic and collaborative work environment
Working Hours: 40 hours per week, with flexibility for on-call services and critical issue resolution.
π Enhancement Note: The salary range for this role is based on market research for Site Reliability Engineering roles in Pune, India, with consideration for the candidate's experience and skills.
π― Team & Company Context
π’ Company Culture
Industry: Public Safety & Justice market, providing software as a service for multi-media evidence management and Emergency Contact Centers.
Company Size: Medium to large-sized organization with a global presence and a strong focus on innovation and growth.
Founded: 1986, with a rich history of providing state-of-the-art solutions to the Public Safety & Justice market.
Team Structure:
- Cloud Operations and Support Teams: Collaborate with these teams to provide technical leadership and oversight.
- Product Teams: Work closely with product teams to ensure cloud platforms meet reliability, scalability, and performance objectives.
- Cross-Functional Teams: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.
Development Methodology:
- Agile/Scrum Methodologies: Utilize Agile/Scrum methodologies for sprint planning, code review, and quality assurance.
- CI/CD Pipelines: Implement CI/CD pipelines for automated deployment and testing.
- Infrastructure as Code (IaC): Employ IaC principles for version control, automation, and consistency.
Company Website: NICE
π Enhancement Note: NICE is a global company with a strong focus on innovation and growth. This role offers the opportunity to work in a dynamic, collaborative environment with a global impact.
π Career & Growth Analysis
Web Technology Career Level: This role is suitable for experienced Site Reliability Engineers looking to drive reliability improvements, lead investigations, and provide technical leadership in a cloud-focused environment.
Reporting Structure: Report directly to the Manager, with close collaboration with Cloud Operations and Support teams, as well as product teams.
Technical Impact: This role has a significant impact on cloud platform reliability, performance, and user experience. The successful candidate will drive reliability improvements, lead investigations, and provide technical guidance to wider teams.
Growth Opportunities:
- Technical Leadership: Develop your technical leadership skills by providing guidance and oversight to Cloud Operations and Support teams.
- Architecture Decisions: Gain experience in making architecture decisions that drive reliability, scalability, and performance.
- Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.
π Enhancement Note: This role offers significant growth opportunities for experienced Site Reliability Engineers looking to develop their technical leadership skills and gain exposure to architecture decisions and emerging technologies.
π Work Environment
Office Type: Modern, collaborative office space with a focus on face-to-face meetings and teamwork.
Office Location(s): Pune, India, with opportunities for remote work and hybrid work arrangements.
Workspace Context:
- Collaborative Work Environment: Work in a collaborative environment with a focus on teamwork and knowledge sharing.
- Development Tools and Resources: Utilize multiple monitors, testing devices, and other resources to support your work.
- Cross-Functional Collaboration: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.
Work Schedule: 40 hours per week, with flexibility for on-call services and critical issue resolution. Work remotely for 3 days per week, with 2 days on-site for face-to-face meetings and collaborative work.
π Enhancement Note: This role offers a modern, collaborative work environment with opportunities for remote work and hybrid work arrangements. The workspace is designed to support teamwork and knowledge sharing, with a focus on driving user-focused solutions.
π Application & Technical Interview Process
Interview Process:
- Technical Screening: Demonstrate your technical skills and problem-solving abilities through coding challenges, system design discussions, and architecture reviews.
- Team Fit Assessment: Showcase your communication skills, cultural fit, and ability to work effectively within a team.
- Final Evaluation: Discuss your technical impact, career goals, and alignment with the role's requirements.
Portfolio Review Tips:
- Cloud Platform Reliability Projects: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
- Incident Management Case Studies: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
- Automation and Scripting Examples: Emphasize your automation and scripting skills, with a focus on infrastructure as code and version control.
Technical Challenge Preparation:
- Cloud Platform Reliability: Brush up on your knowledge of cloud platforms, databases, and monitoring tools.
- Incident Management: Review incident management best practices, root cause analysis techniques, and post-mortem analysis methodologies.
- Automation and Scripting: Refresh your skills in infrastructure as code, version control, and scripting languages like PowerShell or Bash.
ATS Keywords: [Cloud Platform Reliability, Site Reliability Engineering, Azure, Kubernetes, Monitoring, Incident Management, Automation, Infrastructure as Code, Version Control, Technical Leadership, Cloud Services, Databases, Performance Optimization, User Experience, Agile Methodologies, CI/CD Pipelines, Infrastructure as Code (IaC)]
π Enhancement Note: This role requires a strong focus on technical skills, problem-solving, and incident management. Prepare for technical interviews by brushing up on your knowledge of cloud platforms, databases, and monitoring tools, as well as incident management best practices.
π Technology Stack & Web Infrastructure
Cloud Platforms: Azure (Primary), with experience in other commercial cloud providers a plus.
Databases: MS-SQL, Elasticsearch, with experience in additional databases a plus.
Monitoring Tools: Azure Monitor, Prometheus, Grafana, Elasticsearch, with experience in additional monitoring tools a plus.
Infrastructure as Code (IaC) Tools: Bicep, ARM, with experience in additional IaC tools a plus.
Version Control: Git, with experience in additional version control systems a plus.
Scripting Languages: PowerShell, C#, with experience in additional scripting languages a plus.
Containerization: Kubernetes, with experience in additional containerization platforms a plus.
π Enhancement Note: This role requires a strong focus on cloud platforms, databases, and monitoring tools. Experience with Azure, Kubernetes, and relevant monitoring tools is particularly valuable for this position.
π₯ Team Culture & Values
Cloud Platform Reliability Values:
- Reliability: Prioritize cloud platform reliability, availability, and scalability.
- Performance: Optimize cloud platform performance, cost-efficiency, and user experience.
- Automation: Automate low-value tasks to drive efficiency and consistency.
- Collaboration: Work effectively within teams, fostering knowledge sharing and continuous learning.
Collaboration Style:
- Cross-Functional Integration: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.
- Code Review Culture: Participate in code reviews to ensure quality, consistency, and knowledge sharing.
- Peer Programming: Engage in peer programming to drive technical excellence and continuous learning.
π Enhancement Note: NICE fosters a collaborative, knowledge-sharing culture with a strong focus on cloud platform reliability, performance, and user experience. This role offers the opportunity to work in a dynamic, collaborative environment with a global impact.
β‘ Challenges & Growth Opportunities
Technical Challenges:
- Cloud Platform Reliability: Ensure cloud platforms are observable, measurable, reliable, scalable, and maintainable.
- Incident Management: Lead investigations into root cause outages, performance, and cost issues, and drive reliability improvements.
- Automation: Develop automation for low-value tasks, balancing project delivery demands.
- Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.
Learning & Development Opportunities:
- Technical Leadership: Develop your technical leadership skills by providing guidance and oversight to Cloud Operations and Support teams.
- Architecture Decisions: Gain experience in making architecture decisions that drive reliability, scalability, and performance.
- Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.
π Enhancement Note: This role offers significant technical challenges and growth opportunities for experienced Site Reliability Engineers looking to drive reliability improvements, lead investigations, and provide technical leadership in a cloud-focused environment.
π‘ Interview Preparation
Technical Questions:
- Cloud Platform Reliability: Demonstrate your understanding of cloud platform reliability, availability, and scalability.
- Incident Management: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
- Automation: Highlight your automation and scripting skills, with a focus on infrastructure as code and version control.
Company & Culture Questions:
- Cloud Platform Reliability Values: Explain how you prioritize cloud platform reliability, availability, and scalability.
- Collaboration Style: Describe your experience working in a collaborative, knowledge-sharing environment.
- User Experience Impact: Discuss your approach to optimizing cloud platform performance, cost-efficiency, and user experience.
Portfolio Presentation Strategy:
- Cloud Platform Reliability Projects: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
- Incident Management Case Studies: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
- Automation and Scripting Examples: Emphasize your automation and scripting skills, with a focus on infrastructure as code and version control.
π Enhancement Note: This role requires a strong focus on technical skills, problem-solving, and incident management. Prepare for technical interviews by brushing up on your knowledge of cloud platforms, databases, and monitoring tools, as well as incident management best practices.
π Application Steps
To apply for this Cloud Site Reliability Engineer position:
- Customize Your Portfolio: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements. Include incident management case studies and automation examples.
- Optimize Your Resume: Emphasize your technical skills, problem-solving abilities, and incident management experience. Tailor your resume to the role's requirements and include relevant keywords.
- Prepare for Technical Interviews: Brush up on your knowledge of cloud platforms, databases, and monitoring tools. Practice coding challenges, system design discussions, and architecture reviews.
- Research the Company: Familiarize yourself with NICE's products, services, and company culture. Understand their focus on cloud platform reliability, performance, and user experience.
β οΈ Important Notice: This enhanced job description includes AI-generated insights and Site Reliability Engineering industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
Candidates must have at least 2 years of experience in Site Reliability Engineering and possess excellent technical and troubleshooting skills. Experience with databases, programming, monitoring platforms, and cloud services is essential.