Cloud Site Reliability Engineer

NICE
Full_timeβ€’Pune, India

πŸ“ Job Overview

  • Job Title: Cloud Site Reliability Engineer
  • Company: NICE
  • Location: Pune, Mahārāshtra, India
  • Job Type: Hybrid (2 days office, 3 days remote)
  • Category: DevOps, Site Reliability Engineering
  • Date Posted: June 25, 2025
  • Experience Level: 2-5 years
  • Remote Status: On-site with remote flexibility

πŸš€ Role Summary

  • Key Responsibilities: Ensure cloud platforms are observable, measurable, reliable, scalable, and maintainable. Lead investigations into root cause outages, performance, and cost issues. Develop automation for low-value tasks and provide technical leadership to wider Cloud Operations and Support teams.
  • Key Technologies: Azure, Kubernetes, Prometheus, Grafana, Bicep, Git, MS-SQL, Elasticsearch, YML, JSON, XML, C#, PowerShell, Azure DevOps pipelines, NUnit, Jasmine, Selenium.

πŸ“ Enhancement Note: This role requires a strong background in Site Reliability Engineering (SRE) and a deep understanding of cloud platforms, databases, and monitoring tools. Experience with Azure and Kubernetes is particularly valuable for this position.

πŸ’» Primary Responsibilities

  • Ensure Cloud Platform Reliability: Act as a 'gatekeeper' for production, managing the work backlog, and developing reliability improvements.
  • Investigate Outages and Performance Issues: Lead root cause analysis for outages, performance, and cost issues, and drive reliability improvements.
  • Develop Automation: Lead initiatives to automate low-value tasks, balancing project delivery demands.
  • Provide Technical Leadership: Offer guidance and oversight to Cloud Operations and Support teams, as well as the products and services they support.
  • Configure Monitoring Dashboards and Alerts: Develop and configure monitoring dashboards and alerts in tools like Grafana and Azure Monitor.
  • Install and Configure Observability Platform: Install and configure observability platforms, including tools like Grafana, Prometheus, Azure Monitor, and OpenTelemetry.
  • Develop Bicep Modules for Monitoring Infrastructure: Develop Bicep modules for monitoring infrastructure and deploy them.

πŸ“ Enhancement Note: This role requires a strong focus on problem-solving, troubleshooting, and driving reliability improvements. Experience with incident management and post-mortem analysis is essential for success in this role.

πŸŽ“ Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.

Experience: Proven experience (2+ years) in Site Reliability Engineering, Cloud Engineering, or a similar role. Experience with Azure and Kubernetes is a plus.

Required Skills:

  • Excellent technical, analytical, and troubleshooting skills
  • In-depth knowledge of databases and data handling (MS-SQL, Elasticsearch, YML, JSON, XML)
  • Strong programming or advanced scripting skills (C#, PowerShell)
  • Experience with infrastructure/configuration as code and version control (ARM, Bicep, Git)
  • Experience managing monitoring, alerting, and dashboarding platforms (Azure Monitor, Prometheus, Grafana, Elasticsearch)
  • Demonstrated experience supporting live cloud services and platforms
  • Production experience with Kubernetes and containerization
  • Implementation and support of service level objectives (SLOs)
  • Exposure to commercial cloud providers (Ideally Azure, others considered)
  • Exposure to Azure DevOps pipelines (CI/CD)
  • Exposure to test frameworks (NUnit, Jasmine, Selenium)

Preferred Skills:

  • Experience with incident management and post-mortem analysis
  • Familiarity with commercial cloud providers other than Azure
  • Experience with additional programming languages or scripting tools

πŸ“ Enhancement Note: This role requires a strong technical skillset with a focus on cloud platforms, databases, and monitoring tools. Experience with incident management and post-mortem analysis is a significant advantage.

πŸ“Š Web Portfolio & Project Requirements

Portfolio Essentials:

  • Cloud Platform Reliability Projects: Include projects demonstrating your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
  • Incident Management Case Studies: Highlight your experience with incident management, root cause analysis, and post-mortem analysis.
  • Automation and Scripting Examples: Showcase your automation and scripting skills, with a focus on infrastructure as code and version control.

Technical Documentation:

  • Code Quality and Documentation: Demonstrate your commitment to code quality, commenting, and documentation standards.
  • Version Control and Deployment Processes: Highlight your experience with version control, deployment processes, and server configuration.
  • Testing Methodologies: Showcase your understanding of testing methodologies, performance metrics, and optimization techniques.

πŸ“ Enhancement Note: This role requires a strong focus on cloud platform reliability, incident management, and automation. Your portfolio should demonstrate your ability to drive reliability improvements and manage complex cloud environments.

πŸ’΅ Compensation & Benefits

Salary Range: INR 1,200,000 - 1,800,000 per annum (Based on experience and skills)

Benefits:

  • Competitive salary and benefits package
  • Flexible working hours and remote work options
  • Opportunities for professional growth and development
  • A dynamic and collaborative work environment

Working Hours: 40 hours per week, with flexibility for on-call services and critical issue resolution.

πŸ“ Enhancement Note: The salary range for this role is based on market research for Site Reliability Engineering roles in Pune, India, with consideration for the candidate's experience and skills.

🎯 Team & Company Context

🏒 Company Culture

Industry: Public Safety & Justice market, providing software as a service for multi-media evidence management and Emergency Contact Centers.

Company Size: Medium to large-sized organization with a global presence and a strong focus on innovation and growth.

Founded: 1986, with a rich history of providing state-of-the-art solutions to the Public Safety & Justice market.

Team Structure:

  • Cloud Operations and Support Teams: Collaborate with these teams to provide technical leadership and oversight.
  • Product Teams: Work closely with product teams to ensure cloud platforms meet reliability, scalability, and performance objectives.
  • Cross-Functional Teams: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.

Development Methodology:

  • Agile/Scrum Methodologies: Utilize Agile/Scrum methodologies for sprint planning, code review, and quality assurance.
  • CI/CD Pipelines: Implement CI/CD pipelines for automated deployment and testing.
  • Infrastructure as Code (IaC): Employ IaC principles for version control, automation, and consistency.

Company Website: NICE

πŸ“ Enhancement Note: NICE is a global company with a strong focus on innovation and growth. This role offers the opportunity to work in a dynamic, collaborative environment with a global impact.

πŸ“ˆ Career & Growth Analysis

Web Technology Career Level: This role is suitable for experienced Site Reliability Engineers looking to drive reliability improvements, lead investigations, and provide technical leadership in a cloud-focused environment.

Reporting Structure: Report directly to the Manager, with close collaboration with Cloud Operations and Support teams, as well as product teams.

Technical Impact: This role has a significant impact on cloud platform reliability, performance, and user experience. The successful candidate will drive reliability improvements, lead investigations, and provide technical guidance to wider teams.

Growth Opportunities:

  • Technical Leadership: Develop your technical leadership skills by providing guidance and oversight to Cloud Operations and Support teams.
  • Architecture Decisions: Gain experience in making architecture decisions that drive reliability, scalability, and performance.
  • Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.

πŸ“ Enhancement Note: This role offers significant growth opportunities for experienced Site Reliability Engineers looking to develop their technical leadership skills and gain exposure to architecture decisions and emerging technologies.

🌐 Work Environment

Office Type: Modern, collaborative office space with a focus on face-to-face meetings and teamwork.

Office Location(s): Pune, India, with opportunities for remote work and hybrid work arrangements.

Workspace Context:

  • Collaborative Work Environment: Work in a collaborative environment with a focus on teamwork and knowledge sharing.
  • Development Tools and Resources: Utilize multiple monitors, testing devices, and other resources to support your work.
  • Cross-Functional Collaboration: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.

Work Schedule: 40 hours per week, with flexibility for on-call services and critical issue resolution. Work remotely for 3 days per week, with 2 days on-site for face-to-face meetings and collaborative work.

πŸ“ Enhancement Note: This role offers a modern, collaborative work environment with opportunities for remote work and hybrid work arrangements. The workspace is designed to support teamwork and knowledge sharing, with a focus on driving user-focused solutions.

πŸ“„ Application & Technical Interview Process

Interview Process:

  1. Technical Screening: Demonstrate your technical skills and problem-solving abilities through coding challenges, system design discussions, and architecture reviews.
  2. Team Fit Assessment: Showcase your communication skills, cultural fit, and ability to work effectively within a team.
  3. Final Evaluation: Discuss your technical impact, career goals, and alignment with the role's requirements.

Portfolio Review Tips:

  • Cloud Platform Reliability Projects: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
  • Incident Management Case Studies: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
  • Automation and Scripting Examples: Emphasize your automation and scripting skills, with a focus on infrastructure as code and version control.

Technical Challenge Preparation:

  • Cloud Platform Reliability: Brush up on your knowledge of cloud platforms, databases, and monitoring tools.
  • Incident Management: Review incident management best practices, root cause analysis techniques, and post-mortem analysis methodologies.
  • Automation and Scripting: Refresh your skills in infrastructure as code, version control, and scripting languages like PowerShell or Bash.

ATS Keywords: [Cloud Platform Reliability, Site Reliability Engineering, Azure, Kubernetes, Monitoring, Incident Management, Automation, Infrastructure as Code, Version Control, Technical Leadership, Cloud Services, Databases, Performance Optimization, User Experience, Agile Methodologies, CI/CD Pipelines, Infrastructure as Code (IaC)]

πŸ“ Enhancement Note: This role requires a strong focus on technical skills, problem-solving, and incident management. Prepare for technical interviews by brushing up on your knowledge of cloud platforms, databases, and monitoring tools, as well as incident management best practices.

πŸ›  Technology Stack & Web Infrastructure

Cloud Platforms: Azure (Primary), with experience in other commercial cloud providers a plus.

Databases: MS-SQL, Elasticsearch, with experience in additional databases a plus.

Monitoring Tools: Azure Monitor, Prometheus, Grafana, Elasticsearch, with experience in additional monitoring tools a plus.

Infrastructure as Code (IaC) Tools: Bicep, ARM, with experience in additional IaC tools a plus.

Version Control: Git, with experience in additional version control systems a plus.

Scripting Languages: PowerShell, C#, with experience in additional scripting languages a plus.

Containerization: Kubernetes, with experience in additional containerization platforms a plus.

πŸ“ Enhancement Note: This role requires a strong focus on cloud platforms, databases, and monitoring tools. Experience with Azure, Kubernetes, and relevant monitoring tools is particularly valuable for this position.

πŸ‘₯ Team Culture & Values

Cloud Platform Reliability Values:

  • Reliability: Prioritize cloud platform reliability, availability, and scalability.
  • Performance: Optimize cloud platform performance, cost-efficiency, and user experience.
  • Automation: Automate low-value tasks to drive efficiency and consistency.
  • Collaboration: Work effectively within teams, fostering knowledge sharing and continuous learning.

Collaboration Style:

  • Cross-Functional Integration: Collaborate with designers, marketers, and other stakeholders to drive user-focused solutions.
  • Code Review Culture: Participate in code reviews to ensure quality, consistency, and knowledge sharing.
  • Peer Programming: Engage in peer programming to drive technical excellence and continuous learning.

πŸ“ Enhancement Note: NICE fosters a collaborative, knowledge-sharing culture with a strong focus on cloud platform reliability, performance, and user experience. This role offers the opportunity to work in a dynamic, collaborative environment with a global impact.

⚑ Challenges & Growth Opportunities

Technical Challenges:

  • Cloud Platform Reliability: Ensure cloud platforms are observable, measurable, reliable, scalable, and maintainable.
  • Incident Management: Lead investigations into root cause outages, performance, and cost issues, and drive reliability improvements.
  • Automation: Develop automation for low-value tasks, balancing project delivery demands.
  • Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.

Learning & Development Opportunities:

  • Technical Leadership: Develop your technical leadership skills by providing guidance and oversight to Cloud Operations and Support teams.
  • Architecture Decisions: Gain experience in making architecture decisions that drive reliability, scalability, and performance.
  • Emerging Technologies: Stay up-to-date with emerging technologies and trends in cloud platforms, databases, and monitoring tools.

πŸ“ Enhancement Note: This role offers significant technical challenges and growth opportunities for experienced Site Reliability Engineers looking to drive reliability improvements, lead investigations, and provide technical leadership in a cloud-focused environment.

πŸ’‘ Interview Preparation

Technical Questions:

  • Cloud Platform Reliability: Demonstrate your understanding of cloud platform reliability, availability, and scalability.
  • Incident Management: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
  • Automation: Highlight your automation and scripting skills, with a focus on infrastructure as code and version control.

Company & Culture Questions:

  • Cloud Platform Reliability Values: Explain how you prioritize cloud platform reliability, availability, and scalability.
  • Collaboration Style: Describe your experience working in a collaborative, knowledge-sharing environment.
  • User Experience Impact: Discuss your approach to optimizing cloud platform performance, cost-efficiency, and user experience.

Portfolio Presentation Strategy:

  • Cloud Platform Reliability Projects: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements.
  • Incident Management Case Studies: Showcase your experience with incident management, root cause analysis, and post-mortem analysis.
  • Automation and Scripting Examples: Emphasize your automation and scripting skills, with a focus on infrastructure as code and version control.

πŸ“ Enhancement Note: This role requires a strong focus on technical skills, problem-solving, and incident management. Prepare for technical interviews by brushing up on your knowledge of cloud platforms, databases, and monitoring tools, as well as incident management best practices.

πŸ“Œ Application Steps

To apply for this Cloud Site Reliability Engineer position:

  1. Customize Your Portfolio: Highlight projects that demonstrate your ability to ensure cloud platform reliability, manage work backlogs, and drive reliability improvements. Include incident management case studies and automation examples.
  2. Optimize Your Resume: Emphasize your technical skills, problem-solving abilities, and incident management experience. Tailor your resume to the role's requirements and include relevant keywords.
  3. Prepare for Technical Interviews: Brush up on your knowledge of cloud platforms, databases, and monitoring tools. Practice coding challenges, system design discussions, and architecture reviews.
  4. Research the Company: Familiarize yourself with NICE's products, services, and company culture. Understand their focus on cloud platform reliability, performance, and user experience.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and Site Reliability Engineering industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

Candidates must have at least 2 years of experience in Site Reliability Engineering and possess excellent technical and troubleshooting skills. Experience with databases, programming, monitoring platforms, and cloud services is essential.