Senior Site Reliability Engineer (Observability & Resilience)

MagicSchool AI
Full_timeβ€’$130k-150k/year (USD)

πŸ“ Job Overview

  • Job Title: Senior Site Reliability Engineer (Observability & Resilience)
  • Company: MagicSchool AI
  • Location: United States (Remote OK)
  • Job Type: Full-Time
  • Category: DevOps, Site Reliability Engineering
  • Date Posted: 2025-07-02
  • Experience Level: 5-10 years
  • Remote Status: Remote OK

πŸš€ Role Summary

  • πŸ“ Enhancement Note: This role focuses on driving observability and resilience across MagicSchool's generative AI platform for educators, with a strong emphasis on cross-functional collaboration and enabling product engineers.

  • Lead observability strategy and implementation to ensure clear, actionable visibility into platform behavior and performance.

  • Build and maintain internal tooling and dashboards to empower teams with real-time system insights.

  • Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in partnership with product and engineering teams.

  • Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation using Terraform and infrastructure-as-code principles across AWS and Google Cloud.

  • Collaborate with engineers across teams to embed resilient design and observability from the ground up, providing training and pairing support to product engineers.

πŸ’» Primary Responsibilities

  • πŸ“ Enhancement Note: This role requires a balance of technical depth and breadth, with a strong focus on enabling and empowering other engineers to build and maintain observable, resilient systems.

  • Observability Leadership:

    • Design and implement observability patterns, including metrics, logging, tracing, and alerting.
    • Ensure clear, actionable visibility into platform behavior and performance.
  • Build Internal Tooling and Dashboards:

    • Empower teams with real-time system insights by creating intuitive, user-friendly dashboards.
    • Facilitate data-driven decision-making and incident response through effective visualization of platform data.
  • Operational Excellence:

    • Define and maintain SLIs and SLOs in collaboration with product and engineering teams.
    • Establish best practices for alert tuning and signal-to-noise balancing to reduce incident fatigue and improve response accuracy.
  • Platform Resilience:

    • Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation.
    • Leverage Terraform and infrastructure-as-code workflows to ensure consistent, reliable deployments across AWS and Google Cloud.
  • Cross-Functional Enablement:

    • Collaborate with engineers across teams to embed resilient design and observability from the ground up.
    • Provide training and pairing support to product engineers, helping them build and maintain telemetry that supports the full software lifecycle.

πŸŽ“ Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.

Experience: At least 5 years in an SRE, DevOps, or observability-focused role, with a track record of success in fast-paced, high-growth environments.

Required Skills:

  • Proven experience in designing and operating systems for high availability and disaster recovery.
  • Deep expertise with observability tools such as Grafana, Prometheus, Loki, Datadog, and OpenTelemetry.
  • Strong proficiency with Terraform and infrastructure-as-code workflows.
  • Experience with multi-cloud deployments and operating resilient systems at scale.
  • Excellent communication skills, with the ability to explain complex infrastructure and observability concepts to both technical and non-technical audiences.

Preferred Skills:

  • Experience with Sentinel, Loki, or similar logging/metrics stacks.
  • Exposure to educational or compliance-heavy environments.
  • Strong debugging skills and a calm presence during incidents.

πŸ“Š Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate a strong track record of driving observability and resilience in large-scale, complex systems.
  • Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights.
  • Highlight your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing.

Technical Documentation:

  • Provide detailed documentation of your approach to observability, including metrics, logging, tracing, and alerting strategies.
  • Include examples of how you have defined and maintained SLIs and SLOs, and how you have established best practices for alert tuning and signal-to-noise balancing.
  • Demonstrate your ability to collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability.

πŸ’΅ Compensation & Benefits

Salary Range: $130,000 - $150,000 per year (based on regional market rates for senior SRE roles in the United States)

Benefits:

  • Unlimited time off to empower employees to manage their work-life balance.
  • Choice of employer-paid health insurance plans, including dental and vision, at very low premiums.
  • Generous stock options vested over 4 years.
  • 401k match and monthly wellness stipend.

Working Hours: 40 hours per week, with flexibility for deployment windows, maintenance, and project deadlines.

🎯 Team & Company Context

Company Culture:

  • Industry: Education technology, with a focus on generative AI for educators.
  • Company Size: Medium-sized, with a strong emphasis on collaboration, trust, communication, and flexibility.
  • Founded: 2023, with a mission to make education more efficient and equitable through AI technology.

Team Structure:

  • The SRE team works closely with product and engineering teams to ensure the platform's reliability, availability, and performance.
  • The team is responsible for driving observability, resilience, and operational excellence across the platform.

Development Methodology:

  • Agile development methodologies, with a focus on collaboration, iteration, and continuous improvement.
  • Code reviews, testing, and quality assurance practices to ensure high-quality, maintainable code.
  • Deployment strategies, CI/CD pipelines, and server management to support the platform's scalability and resilience.

Company Website: MagicSchool AI

πŸ“ Enhancement Note: MagicSchool AI places a strong emphasis on fostering a unique culture built on relationships, trust, communication, and collaboration, regardless of team members' locations.

πŸ“ˆ Career & Growth Analysis

Web Technology Career Level: Senior Site Reliability Engineer, responsible for driving observability and resilience across the platform, with a strong focus on enabling and empowering other engineers.

Reporting Structure: This role reports directly to the Head of Site Reliability Engineering and collaborates closely with product and engineering teams.

Technical Impact: This role has a significant impact on the platform's reliability, availability, and performance, as well as the ability to empower other engineers to build and maintain observable, resilient systems.

Growth Opportunities:

  • Growth Opportunity 1: Expand your expertise in observability and resilience, driving best practices and standards across the organization.
  • Growth Opportunity 2: Develop your leadership skills by mentoring other engineers and contributing to the team's growth and development.
  • Growth Opportunity 3: Explore opportunities to specialize in specific areas of observability, resilience, or infrastructure, depending on your interests and the organization's needs.

πŸ“ Enhancement Note: MagicSchool AI offers ample opportunities for growth and development, with a strong emphasis on enabling employees to take ownership of their careers and contribute to the organization's success.

🌐 Work Environment

Office Type: Remote-first, with a strong emphasis on collaboration, trust, communication, and flexibility.

Office Location(s): United States, with a diverse, global user base.

Workspace Context:

  • Workspace Aspect 1: Collaborative work environment, with a strong emphasis on cross-functional teamwork and communication.
  • Workspace Aspect 2: Access to modern development tools, multiple monitors, and testing devices to support effective observability and resilience work.
  • Workspace Aspect 3: Opportunities for knowledge sharing, technical mentoring, and continuous learning, with a strong emphasis on enabling and empowering other engineers.

Work Schedule: Flexible work schedule, with core hours and regular team meetings to facilitate collaboration and communication. Working hours may vary depending on project deadlines, maintenance windows, and incident response.

πŸ“ Enhancement Note: MagicSchool AI's remote-first work environment fosters a unique culture built on relationships, trust, communication, and collaboration, with a strong emphasis on empowering employees to manage their work-life balance.

πŸ“„ Application & Technical Interview Process

Interview Process:

  • Process Step 1: Technical screening to assess your understanding of observability, resilience, and infrastructure-as-code principles. Prepare for coding and configuration assessment exercises related to these topics.
  • Process Step 2: Deep dive into your observability and resilience strategies, with a focus on system design and architecture. Be prepared to discuss your approach to alert tuning, signal-to-noise balancing, and incident response.
  • Process Step 3: Cultural fit assessment, with a focus on your ability to collaborate effectively with product and engineering teams. Prepare to discuss your approach to enabling and empowering other engineers.
  • Process Step 4: Final evaluation criteria, including your ability to drive observability and resilience across the platform and your potential for growth and development within the organization.

Portfolio Review Tips:

  • Portfolio Tip 1: Highlight your track record of driving observability and resilience in large-scale, complex systems.
  • Portfolio Tip 2: Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights.
  • Portfolio Tip 3: Demonstrate your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing.
  • Portfolio Tip 4: Emphasize your ability to collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability.

Technical Challenge Preparation:

  • Challenge Preparation 1: Familiarize yourself with MagicSchool's platform and user base, with a focus on the unique challenges and opportunities presented by the education technology industry.
  • Challenge Preparation 2: Brush up on your knowledge of observability tools, infrastructure-as-code workflows, and multi-cloud deployments.
  • Challenge Preparation 3: Prepare for scenario-based exercises that assess your ability to drive observability and resilience in a dynamic, fast-paced environment.

ATS Keywords: Site Reliability Engineering, Observability, Resilience, Infrastructure as Code, Terraform, AWS, Google Cloud, Incident Response, Alert Fatigue Reduction, Collaboration, Communication, Telemetry, Operational Excellence, High Availability, Disaster Recovery, Real-Time Insights, Training, Product Engineering, Agile Methodologies, Code Reviews, Testing, Quality Assurance, Deployment Strategies, CI/CD Pipelines, Server Management, Education Technology, Generative AI.

πŸ“ Enhancement Note: MagicSchool AI's interview process focuses on assessing your technical expertise, cultural fit, and potential for growth and development within the organization. Prepare for a challenging, engaging, and insightful interview experience.

πŸ›  Technology Stack & Web Infrastructure

Observability Tools:

  • Grafana: For visualizing metrics, logs, and traces.
  • Prometheus: For monitoring and alerting based on custom metrics.
  • Loki: For logging and monitoring of structured and unstructured data.
  • Datadog: For cloud-based monitoring, alerting, and observability.
  • OpenTelemetry: For instrumenting, generating, collecting, and exporting telemetry data to help analyze software systems.

Infrastructure Tools:

  • Terraform: For infrastructure as code, enabling consistent, reliable deployments across AWS and Google Cloud.
  • AWS: For cloud-based infrastructure, including EC2, RDS, and S3 services.
  • Google Cloud: For cloud-based infrastructure, including Compute Engine, Cloud SQL, and Cloud Storage services.

Development & DevOps Tools:

  • Git: For version control and collaborative development.
  • GitHub: For remote repositories, code reviews, and project management.
  • Jenkins: For continuous integration and deployment pipelines.
  • Ansible: For configuration management and deployment automation.

πŸ“ Enhancement Note: MagicSchool AI's technology stack is designed to support the platform's scalability, resilience, and observability, with a strong emphasis on enabling and empowering engineers to build and maintain high-quality, performant systems.

πŸ‘₯ Team Culture & Values

Web Development Values:

  • Value 1: Educators are the most important ingredient in the educational process - they are the magic, not the AI. Trust them, empower them, and put them at the center of leading change in service of students and families.
  • Value 2: Bring joy and magic into every learning experience - push the boundaries of what’s possible with AI.
  • Value 3: Foster community that supports one another during a time of rapid technological change. Listen to them and serve their needs.
  • Value 4: The education system is outdated and in need of innovation and change - AI is an opportunity to bring equity, access, and serve the individual needs of students better than we ever have before.
  • Value 5: Put responsibility and safety at the forefront of the technological change that AI is bringing to education.
  • Value 6: Diversity of thought, perspectives, and backgrounds helps us serve the wide audience of educators and students around the world.
  • Value 7: Educators and students deserve the best - and we strive for the highest quality in everything we do.

Collaboration Style:

  • Collaboration Approach 1: Cross-functional integration between developers, designers, and stakeholders, with a strong emphasis on user experience and user impact measurement.
  • Collaboration Approach 2: Code review culture and peer programming practices, with a focus on knowledge sharing and continuous learning.
  • Collaboration Approach 3: Regular team meetings and one-on-one check-ins to facilitate communication, collaboration, and growth.

πŸ“ Enhancement Note: MagicSchool AI's team culture is built on a strong foundation of trust, communication, and collaboration, with a shared commitment to driving innovation and change in the education technology industry.

⚑ Challenges & Growth Opportunities

Technical Challenges:

  • Challenge 1: Design and implement observability patterns that ensure clear, actionable visibility into platform behavior and performance, while minimizing alert fatigue and maximizing signal-to-noise ratio.
  • Challenge 2: Establish and maintain SLIs and SLOs that balance the needs of the platform, users, and educators, while minimizing downtime and maximizing system availability.
  • Challenge 3: Build and maintain internal tooling and dashboards that empower teams with real-time system insights, while minimizing manual effort and maximizing user impact.
  • Challenge 4: Collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability, while minimizing technical debt and maximizing system performance.

Learning & Development Opportunities:

  • Learning Opportunity 1: Expand your expertise in observability and resilience, with a focus on driving best practices and standards across the organization.
  • Learning Opportunity 2: Develop your leadership skills by mentoring other engineers and contributing to the team's growth and development.
  • Learning Opportunity 3: Explore opportunities to specialize in specific areas of observability, resilience, or infrastructure, depending on your interests and the organization's needs.

πŸ“ Enhancement Note: MagicSchool AI offers ample opportunities for growth and development, with a strong emphasis on enabling employees to take ownership of their careers and contribute to the organization's success.

πŸ’‘ Interview Preparation

Technical Questions:

  • Technical Question 1: Describe your approach to designing and implementing observability patterns, with a focus on minimizing alert fatigue and maximizing signal-to-noise ratio.
  • Technical Question 2: How do you establish and maintain SLIs and SLOs that balance the needs of the platform, users, and educators, while minimizing downtime and maximizing system availability?
  • Technical Question 3: Walk us through your process for building and maintaining internal tooling and dashboards that empower teams with real-time system insights, while minimizing manual effort and maximizing user impact.

Company & Culture Questions:

  • Technical Question 4: How do you approach collaborating with product and engineering teams to plan for Resilience, Recovery, and Availability, while minimizing technical debt and maximizing system performance?
  • Technical Question 5: Describe your experience with education technology and generative AI, and how you have leveraged these tools to drive innovation and change in the education industry.
  • Technical Question 6: How do you balance the needs of educators, students, and the platform when making technical decisions, and how do you ensure that your solutions are user-focused and impactful?

Portfolio Presentation Strategy:

  • Presentation Strategy 1: Highlight your track record of driving observability and resilience in large-scale, complex systems, with a focus on the unique challenges and opportunities presented by the education technology industry.
  • Presentation Strategy 2: Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights, with a focus on user experience and user impact measurement.
  • Presentation Strategy 3: Demonstrate your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing, with a focus on driving innovation and change in the education technology industry.

πŸ“ Enhancement Note: MagicSchool AI's interview process focuses on assessing your technical expertise, cultural fit, and potential for growth and development within the organization. Prepare for a challenging, engaging, and insightful interview experience.

πŸ“Œ Application Steps

To apply for this Senior Site Reliability Engineer (Observability & Resilience) position at MagicSchool AI:

  1. Concrete Preparation Step 1: Tailor your resume and portfolio to highlight your experience in driving observability and resilience in large-scale, complex systems, with a focus on the unique challenges and opportunities presented by the education technology industry.
  2. Concrete Preparation Step 2: Research MagicSchool AI's platform, user base, and company culture, with a focus on the unique challenges and opportunities presented by the education technology industry.
  3. Concrete Preparation Step 3: Prepare for technical interviews by brushing up on your knowledge of observability tools, infrastructure-as-code workflows, and multi-cloud deployments, with a focus on the unique challenges and opportunities presented by the education technology industry.
  4. Concrete Preparation Step 4: Submit your application through the application link provided, and follow up with any additional information or clarification as needed.

πŸ“ Enhancement Note: MagicSchool AI's application process is designed to be comprehensive, engaging, and insightful, with a strong emphasis on assessing your technical expertise, cultural fit, and potential for growth and development within the organization. Prepare for a challenging, engaging, and insightful interview experience.


Application Requirements

At least 5 years of experience in an SRE, DevOps, or observability-focused role is required. Candidates should have expertise in observability tools and infrastructure skills, particularly with Terraform and multi-cloud deployments.