📍 Job Overview

Job Title: Senior Site Reliability Engineer (Observability & Resilience)
Company: MagicSchool AI
Location: United States (Remote OK)
Job Type: Full-Time
Category: DevOps, Site Reliability Engineering
Date Posted: 2025-07-02
Experience Level: 5-10 years
Remote Status: Remote OK

🚀 Role Summary

📝 Enhancement Note: This role focuses on driving observability and resilience across MagicSchool's generative AI platform for educators, with a strong emphasis on cross-functional collaboration and enabling product engineers.
Lead observability strategy and implementation to ensure clear, actionable visibility into platform behavior and performance.
Build and maintain internal tooling and dashboards to empower teams with real-time system insights.
Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in partnership with product and engineering teams.
Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation using Terraform and infrastructure-as-code principles across AWS and Google Cloud.
Collaborate with engineers across teams to embed resilient design and observability from the ground up, providing training and pairing support to product engineers.

💻 Primary Responsibilities

📝 Enhancement Note: This role requires a balance of technical depth and breadth, with a strong focus on enabling and empowering other engineers to build and maintain observable, resilient systems.
Observability Leadership:
- Design and implement observability patterns, including metrics, logging, tracing, and alerting.
- Ensure clear, actionable visibility into platform behavior and performance.
Build Internal Tooling and Dashboards:
- Empower teams with real-time system insights by creating intuitive, user-friendly dashboards.
- Facilitate data-driven decision-making and incident response through effective visualization of platform data.
Operational Excellence:
- Define and maintain SLIs and SLOs in collaboration with product and engineering teams.
- Establish best practices for alert tuning and signal-to-noise balancing to reduce incident fatigue and improve response accuracy.
Platform Resilience:
- Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation.
- Leverage Terraform and infrastructure-as-code workflows to ensure consistent, reliable deployments across AWS and Google Cloud.
Cross-Functional Enablement:
- Collaborate with engineers across teams to embed resilient design and observability from the ground up.
- Provide training and pairing support to product engineers, helping them build and maintain telemetry that supports the full software lifecycle.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.

Experience: At least 5 years in an SRE, DevOps, or observability-focused role, with a track record of success in fast-paced, high-growth environments.

Required Skills:

Proven experience in designing and operating systems for high availability and disaster recovery.
Deep expertise with observability tools such as Grafana, Prometheus, Loki, Datadog, and OpenTelemetry.
Strong proficiency with Terraform and infrastructure-as-code workflows.
Experience with multi-cloud deployments and operating resilient systems at scale.
Excellent communication skills, with the ability to explain complex infrastructure and observability concepts to both technical and non-technical audiences.

Preferred Skills:

Experience with Sentinel, Loki, or similar logging/metrics stacks.
Exposure to educational or compliance-heavy environments.
Strong debugging skills and a calm presence during incidents.

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

Demonstrate a strong track record of driving observability and resilience in large-scale, complex systems.
Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights.
Highlight your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing.

Technical Documentation:

Provide detailed documentation of your approach to observability, including metrics, logging, tracing, and alerting strategies.
Include examples of how you have defined and maintained SLIs and SLOs, and how you have established best practices for alert tuning and signal-to-noise balancing.
Demonstrate your ability to collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability.

💵 Compensation & Benefits

Salary Range: $130,000 - $150,000 per year (based on regional market rates for senior SRE roles in the United States)

Benefits:

Unlimited time off to empower employees to manage their work-life balance.
Choice of employer-paid health insurance plans, including dental and vision, at very low premiums.
Generous stock options vested over 4 years.
401k match and monthly wellness stipend.

Working Hours: 40 hours per week, with flexibility for deployment windows, maintenance, and project deadlines.

🎯 Team & Company Context

Company Culture:

Industry: Education technology, with a focus on generative AI for educators.
Company Size: Medium-sized, with a strong emphasis on collaboration, trust, communication, and flexibility.
Founded: 2023, with a mission to make education more efficient and equitable through AI technology.

Team Structure:

The SRE team works closely with product and engineering teams to ensure the platform's reliability, availability, and performance.
The team is responsible for driving observability, resilience, and operational excellence across the platform.

Development Methodology:

Agile development methodologies, with a focus on collaboration, iteration, and continuous improvement.
Code reviews, testing, and quality assurance practices to ensure high-quality, maintainable code.
Deployment strategies, CI/CD pipelines, and server management to support the platform's scalability and resilience.

Company Website: MagicSchool AI

📝 Enhancement Note: MagicSchool AI places a strong emphasis on fostering a unique culture built on relationships, trust, communication, and collaboration, regardless of team members' locations.

📈 Career & Growth Analysis

Web Technology Career Level: Senior Site Reliability Engineer, responsible for driving observability and resilience across the platform, with a strong focus on enabling and empowering other engineers.

Reporting Structure: This role reports directly to the Head of Site Reliability Engineering and collaborates closely with product and engineering teams.

Technical Impact: This role has a significant impact on the platform's reliability, availability, and performance, as well as the ability to empower other engineers to build and maintain observable, resilient systems.

Growth Opportunities:

Growth Opportunity 1: Expand your expertise in observability and resilience, driving best practices and standards across the organization.
Growth Opportunity 2: Develop your leadership skills by mentoring other engineers and contributing to the team's growth and development.
Growth Opportunity 3: Explore opportunities to specialize in specific areas of observability, resilience, or infrastructure, depending on your interests and the organization's needs.

📝 Enhancement Note: MagicSchool AI offers ample opportunities for growth and development, with a strong emphasis on enabling employees to take ownership of their careers and contribute to the organization's success.

🌐 Work Environment

Office Type: Remote-first, with a strong emphasis on collaboration, trust, communication, and flexibility.

Office Location(s): United States, with a diverse, global user base.

Workspace Context:

Workspace Aspect 1: Collaborative work environment, with a strong emphasis on cross-functional teamwork and communication.
Workspace Aspect 2: Access to modern development tools, multiple monitors, and testing devices to support effective observability and resilience work.
Workspace Aspect 3: Opportunities for knowledge sharing, technical mentoring, and continuous learning, with a strong emphasis on enabling and empowering other engineers.

Work Schedule: Flexible work schedule, with core hours and regular team meetings to facilitate collaboration and communication. Working hours may vary depending on project deadlines, maintenance windows, and incident response.

📝 Enhancement Note: MagicSchool AI's remote-first work environment fosters a unique culture built on relationships, trust, communication, and collaboration, with a strong emphasis on empowering employees to manage their work-life balance.

📄 Application & Technical Interview Process

Interview Process:

Process Step 1: Technical screening to assess your understanding of observability, resilience, and infrastructure-as-code principles. Prepare for coding and configuration assessment exercises related to these topics.
Process Step 2: Deep dive into your observability and resilience strategies, with a focus on system design and architecture. Be prepared to discuss your approach to alert tuning, signal-to-noise balancing, and incident response.
Process Step 3: Cultural fit assessment, with a focus on your ability to collaborate effectively with product and engineering teams. Prepare to discuss your approach to enabling and empowering other engineers.
Process Step 4: Final evaluation criteria, including your ability to drive observability and resilience across the platform and your potential for growth and development within the organization.

Portfolio Review Tips:

Portfolio Tip 1: Highlight your track record of driving observability and resilience in large-scale, complex systems.
Portfolio Tip 2: Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights.
Portfolio Tip 3: Demonstrate your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing.
Portfolio Tip 4: Emphasize your ability to collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability.

Technical Challenge Preparation:

Challenge Preparation 1: Familiarize yourself with MagicSchool's platform and user base, with a focus on the unique challenges and opportunities presented by the education technology industry.
Challenge Preparation 2: Brush up on your knowledge of observability tools, infrastructure-as-code workflows, and multi-cloud deployments.
Challenge Preparation 3: Prepare for scenario-based exercises that assess your ability to drive observability and resilience in a dynamic, fast-paced environment.

ATS Keywords: Site Reliability Engineering, Observability, Resilience, Infrastructure as Code, Terraform, AWS, Google Cloud, Incident Response, Alert Fatigue Reduction, Collaboration, Communication, Telemetry, Operational Excellence, High Availability, Disaster Recovery, Real-Time Insights, Training, Product Engineering, Agile Methodologies, Code Reviews, Testing, Quality Assurance, Deployment Strategies, CI/CD Pipelines, Server Management, Education Technology, Generative AI.

📝 Enhancement Note: MagicSchool AI's interview process focuses on assessing your technical expertise, cultural fit, and potential for growth and development within the organization. Prepare for a challenging, engaging, and insightful interview experience.

🛠 Technology Stack & Web Infrastructure

Observability Tools:

Grafana: For visualizing metrics, logs, and traces.
Prometheus: For monitoring and alerting based on custom metrics.
Loki: For logging and monitoring of structured and unstructured data.
Datadog: For cloud-based monitoring, alerting, and observability.
OpenTelemetry: For instrumenting, generating, collecting, and exporting telemetry data to help analyze software systems.

Infrastructure Tools:

Terraform: For infrastructure as code, enabling consistent, reliable deployments across AWS and Google Cloud.
AWS: For cloud-based infrastructure, including EC2, RDS, and S3 services.
Google Cloud: For cloud-based infrastructure, including Compute Engine, Cloud SQL, and Cloud Storage services.

Development & DevOps Tools:

Git: For version control and collaborative development.
GitHub: For remote repositories, code reviews, and project management.
Jenkins: For continuous integration and deployment pipelines.
Ansible: For configuration management and deployment automation.

📝 Enhancement Note: MagicSchool AI's technology stack is designed to support the platform's scalability, resilience, and observability, with a strong emphasis on enabling and empowering engineers to build and maintain high-quality, performant systems.

👥 Team Culture & Values

Web Development Values:

Value 1: Educators are the most important ingredient in the educational process - they are the magic, not the AI. Trust them, empower them, and put them at the center of leading change in service of students and families.
Value 2: Bring joy and magic into every learning experience - push the boundaries of what’s possible with AI.
Value 3: Foster community that supports one another during a time of rapid technological change. Listen to them and serve their needs.
Value 4: The education system is outdated and in need of innovation and change - AI is an opportunity to bring equity, access, and serve the individual needs of students better than we ever have before.
Value 5: Put responsibility and safety at the forefront of the technological change that AI is bringing to education.
Value 6: Diversity of thought, perspectives, and backgrounds helps us serve the wide audience of educators and students around the world.
Value 7: Educators and students deserve the best - and we strive for the highest quality in everything we do.

Collaboration Style:

Collaboration Approach 1: Cross-functional integration between developers, designers, and stakeholders, with a strong emphasis on user experience and user impact measurement.
Collaboration Approach 2: Code review culture and peer programming practices, with a focus on knowledge sharing and continuous learning.
Collaboration Approach 3: Regular team meetings and one-on-one check-ins to facilitate communication, collaboration, and growth.

📝 Enhancement Note: MagicSchool AI's team culture is built on a strong foundation of trust, communication, and collaboration, with a shared commitment to driving innovation and change in the education technology industry.

⚡ Challenges & Growth Opportunities

Technical Challenges:

Challenge 1: Design and implement observability patterns that ensure clear, actionable visibility into platform behavior and performance, while minimizing alert fatigue and maximizing signal-to-noise ratio.
Challenge 2: Establish and maintain SLIs and SLOs that balance the needs of the platform, users, and educators, while minimizing downtime and maximizing system availability.
Challenge 3: Build and maintain internal tooling and dashboards that empower teams with real-time system insights, while minimizing manual effort and maximizing user impact.
Challenge 4: Collaborate with product and engineering teams to plan for Resilience, Recovery, and Availability, while minimizing technical debt and maximizing system performance.

Learning & Development Opportunities:

Learning Opportunity 1: Expand your expertise in observability and resilience, with a focus on driving best practices and standards across the organization.
Learning Opportunity 2: Develop your leadership skills by mentoring other engineers and contributing to the team's growth and development.
Learning Opportunity 3: Explore opportunities to specialize in specific areas of observability, resilience, or infrastructure, depending on your interests and the organization's needs.

💡 Interview Preparation

Technical Questions:

Technical Question 1: Describe your approach to designing and implementing observability patterns, with a focus on minimizing alert fatigue and maximizing signal-to-noise ratio.
Technical Question 2: How do you establish and maintain SLIs and SLOs that balance the needs of the platform, users, and educators, while minimizing downtime and maximizing system availability?
Technical Question 3: Walk us through your process for building and maintaining internal tooling and dashboards that empower teams with real-time system insights, while minimizing manual effort and maximizing user impact.

Company & Culture Questions:

Technical Question 4: How do you approach collaborating with product and engineering teams to plan for Resilience, Recovery, and Availability, while minimizing technical debt and maximizing system performance?
Technical Question 5: Describe your experience with education technology and generative AI, and how you have leveraged these tools to drive innovation and change in the education industry.
Technical Question 6: How do you balance the needs of educators, students, and the platform when making technical decisions, and how do you ensure that your solutions are user-focused and impactful?

Portfolio Presentation Strategy:

Presentation Strategy 1: Highlight your track record of driving observability and resilience in large-scale, complex systems, with a focus on the unique challenges and opportunities presented by the education technology industry.
Presentation Strategy 2: Showcase your ability to build and maintain internal tooling and dashboards that empower teams with real-time system insights, with a focus on user experience and user impact measurement.
Presentation Strategy 3: Demonstrate your experience in defining and maintaining SLIs and SLOs, and your ability to establish best practices for alert tuning and signal-to-noise balancing, with a focus on driving innovation and change in the education technology industry.

📌 Application Steps

To apply for this Senior Site Reliability Engineer (Observability & Resilience) position at MagicSchool AI:

Concrete Preparation Step 1: Tailor your resume and portfolio to highlight your experience in driving observability and resilience in large-scale, complex systems, with a focus on the unique challenges and opportunities presented by the education technology industry.
Concrete Preparation Step 2: Research MagicSchool AI's platform, user base, and company culture, with a focus on the unique challenges and opportunities presented by the education technology industry.
Concrete Preparation Step 3: Prepare for technical interviews by brushing up on your knowledge of observability tools, infrastructure-as-code workflows, and multi-cloud deployments, with a focus on the unique challenges and opportunities presented by the education technology industry.
Concrete Preparation Step 4: Submit your application through the application link provided, and follow up with any additional information or clarification as needed.

📝 Enhancement Note: MagicSchool AI's application process is designed to be comprehensive, engaging, and insightful, with a strong emphasis on assessing your technical expertise, cultural fit, and potential for growth and development within the organization. Prepare for a challenging, engaging, and insightful interview experience.

Senior Site Reliability Engineer (Observability & Resilience)