Cloud Observability and Performance Engineer

Halcyon
Full_time$150k-185k/year (USD)

📍 Job Overview

  • Job Title: Cloud Observability and Performance Engineer
  • Company: Halcyon
  • Location: Remote
  • Job Type: Full-Time
  • Category: DevOps Engineer
  • Date Posted: 2025-07-08
  • Experience Level: 5-10 years
  • Remote Status: Remote Solely

🚀 Role Summary

  • Design and implement end-to-end observability for distributed cloud services, ensuring high performance, availability, and scalability of agent management systems in production.
  • Collaborate with development, SRE, and security teams to troubleshoot production issues using observability tooling.
  • Define and implement SLOs, SLIs, and performance benchmarks for cloud components and services.
  • Instrument code and services to expose business-relevant metrics and latency bottlenecks.
  • Automate performance regression testing and anomaly detection.

📝 Enhancement Note: This role requires a strong background in observability, site reliability, or cloud performance to ensure the reliability and performance of cloud-based security operations at scale.

💻 Primary Responsibilities

  • Observability & Monitoring: Design, build, and maintain end-to-end observability for distributed cloud services (telemetry, logging, tracing, alerting).
  • Metrics & Dashboards: Develop and optimize metrics pipelines and dashboards using tools like Prometheus, Grafana, OpenTelemetry, and Datadog.
  • Performance Optimization: Ensure high performance, availability, and scalability of agent management systems in production.
  • Collaboration: Collaborate with development, SRE, and security teams to troubleshoot production issues using observability tooling.
  • SLOs & Benchmarks: Define and implement SLOs, SLIs, and performance benchmarks for cloud components and services.
  • Instrumentation: Instrument code and services to expose business-relevant metrics and latency bottlenecks.
  • Automation: Automate performance regression testing and anomaly detection.
  • Incident Detection: Support proactive incident detection and real-time monitoring strategies across multi-cloud environments.
  • Root Cause Analysis: Provide root cause analysis and performance tuning recommendations.

📝 Enhancement Note: This role involves a high level of technical responsibility, requiring strong problem-solving skills and a deep understanding of distributed systems, microservices, and performance debugging.

🎓 Skills & Qualifications

Education: Bachelor's degree in Computer Science, Engineering, or a related field. Relevant experience may be considered in lieu of a degree.

Experience: 5+ years of professional work experience in observability, site reliability, or cloud performance roles.

Required Skills:

  • Strong experience with monitoring and observability stacks (e.g., Prometheus, Grafana, ELK, OpenTelemetry, Datadog, AWS CloudWatch).
  • Proficiency in cloud platforms (e.g., AWS, GCP, Azure) and cloud-native services (e.g., ECS, EKS, Lambda).
  • Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Solid knowledge of distributed systems, microservices, and performance debugging.
  • Proficiency in Python, Scala, or other language(s) for tooling and automation.
  • Familiarity with CI/CD pipelines, infrastructure as code (e.g., Terraform), and version control (Git).

Preferred Skills:

  • Experience with endpoint security platforms or agent-based systems.
  • Familiarity with SIEM, security analytics, or cloud threat detection pipelines.
  • Background in networking performance, TLS handshake optimization, or load balancing.
  • Experience with SLA/SLO-driven operational excellence in high-scale environments.
  • Knowledge of additional languages, such as Go.

📝 Enhancement Note: While not required, experience with endpoint security platforms or agent-based systems, as well as familiarity with SIEM and cloud threat detection pipelines, would be highly beneficial for this role.

📊 Web Portfolio & Project Requirements

Portfolio Essentials:

  • Demonstrate experience with monitoring and observability stacks, highlighting your ability to design, build, and maintain end-to-end observability for distributed cloud services.
  • Showcase your proficiency in cloud platforms and services by presenting projects that showcase your ability to ensure high performance, availability, and scalability of agent management systems in production.
  • Highlight your collaboration skills by presenting projects where you worked with development, SRE, and security teams to troubleshoot production issues using observability tooling.
  • Display your ability to define and implement SLOs, SLIs, and performance benchmarks for cloud components and services by presenting relevant projects or case studies.

Technical Documentation:

  • Provide code samples or snippets that demonstrate your ability to instrument code and services to expose business-relevant metrics and latency bottlenecks.
  • Showcase your automation skills by presenting projects or case studies that demonstrate your ability to automate performance regression testing and anomaly detection.
  • Include any relevant documentation, such as architecture diagrams, system design documents, or performance testing results, that showcase your technical skills and problem-solving abilities.

📝 Enhancement Note: While not required, including examples of how you've supported proactive incident detection and real-time monitoring strategies across multi-cloud environments, as well as providing root cause analysis and performance tuning recommendations, would strengthen your portfolio.

💵 Compensation & Benefits

Salary Range: $150,000 - $185,000 USD per year

  • Halcyon offers a competitive salary, with a base salary range of $150,000 - $185,000 USD per year, depending on experience and qualifications.

Bonus Target: 10% of base salary

  • Halcyon offers an annual bonus target of 10% of the base salary for eligible employees.

Benefits:

  • Comprehensive healthcare (medical, dental, and vision) with premiums paid in full for employees and dependents.
  • 401k plan with a generous employer contribution.
  • Short and long-term disability coverage, basic life and AD&D insurance plans.
  • Medical and dependent care FSA options.
  • Flexible PTO policy.
  • Parental leave.
  • Generous equity offering.

📝 Enhancement Note: Halcyon's benefits package is designed to provide comprehensive support for employees and their families, with a focus on work-life balance and long-term financial security.

🎯 Team & Company Context

🏢 Company Culture

Industry: Cybersecurity

  • Halcyon is a dedicated, adaptive security platform focused specifically on stopping ransomware, operating in the cybersecurity industry.

Company Size: Medium-sized company

  • Halcyon is a medium-sized company, providing a collaborative and dynamic work environment for its employees.

Founded: 2021

  • Halcyon was formed in 2021 by a team of cyber industry veterans after battling the scourge of ransomware and advanced threats for years at some of the largest global security vendors.

Team Structure:

  • Halcyon's Chaos Cloud Engineering team is responsible for designing and implementing observability, monitoring, and performance strategies for cloud-hosted microservices that manage and orchestrate endpoint security agents at scale.
  • The team works closely with development, SRE, and security teams to ensure the reliability, visibility, and performance optimization of backend systems that power cloud-based security operations for millions of endpoints worldwide.

Development Methodology:

  • Halcyon uses Agile methodologies, such as Scrum, to manage its development processes.
  • The company emphasizes code review, testing, and quality assurance practices to ensure the reliability and performance of its products.
  • Halcyon uses deployment strategies, CI/CD pipelines, and server management tools to automate and streamline its development processes.

Company Website: halcyon.ai

📝 Enhancement Note: Halcyon's company culture is characterized by its commitment to building products and solutions focused on stopping ransomware and advanced threats, with a strong emphasis on collaboration, innovation, and continuous learning.

📈 Career & Growth Analysis

Web Technology Career Level: Senior DevOps Engineer

  • This role is at the senior level within the DevOps engineering career path, requiring a high level of technical expertise and experience in observability, site reliability, or cloud performance.

Reporting Structure:

  • The Cloud Observability and Performance Engineer reports directly to the Chaos Cloud Engineering Manager and works closely with development, SRE, and security teams.

Technical Impact:

  • This role has a significant impact on the reliability, visibility, and performance optimization of Halcyon's backend systems, which power cloud-based security operations for millions of endpoints worldwide.
  • The Cloud Observability and Performance Engineer's work directly influences the company's ability to detect and mitigate security threats, ensuring the safety and security of its customers' data.

Growth Opportunities:

  • Technical Growth: Halcyon offers opportunities for technical growth through its commitment to using cutting-edge cloud infrastructure and working on global security products.
  • Leadership Development: As a senior-level role, this position provides opportunities for technical leadership and mentoring within the Chaos Cloud Engineering team and across the broader organization.
  • Architecture Decisions: The Cloud Observability and Performance Engineer has the opportunity to make significant architecture decisions that impact the reliability, performance, and scalability of Halcyon's cloud-based security operations.

📝 Enhancement Note: Halcyon's commitment to using cutting-edge cloud infrastructure and working on global security products provides a unique opportunity for the Cloud Observability and Performance Engineer to grow both technically and professionally within the organization.

🌐 Work Environment

Office Type: Remote-first

  • Halcyon is a remote-native, completely distributed global team, recognizing that great talent can exist anywhere.

Office Location(s): Remote

  • As a remote-first company, Halcyon does not have physical office locations. Employees can work from anywhere in the world.

Workspace Context:

  • Collaboration: Halcyon's remote work environment emphasizes collaboration and communication, with regular team meetings and one-on-ones to ensure everyone is aligned and working towards the same goals.
  • Development Tools: Halcyon provides its employees with the necessary tools and resources to perform their jobs effectively, including multiple monitors, testing devices, and access to relevant software and platforms.
  • Cross-Functional Collaboration: Halcyon's remote work environment encourages cross-functional collaboration between teams, with regular check-ins and updates to ensure everyone is working together towards the company's goals.

Work Schedule:

  • Halcyon offers a flexible work schedule, with a focus on results and deliverables rather than strict hours.
  • The company understands that employees have different schedules and priorities, and it trusts its team members to manage their time effectively.

📝 Enhancement Note: Halcyon's remote-first work environment provides a high degree of flexibility and autonomy, allowing employees to balance their work and personal lives more effectively.

📄 Application & Technical Interview Process

Interview Process:

  1. Technical Phone Screen: A 30-minute phone screen to assess your technical skills and understanding of the role's requirements.
  2. Technical Deep Dive: A 90-minute deep dive into your technical skills, focusing on your experience with monitoring and observability stacks, cloud platforms, and containerization tools.
  3. Behavioral Interview: A 30-minute behavioral interview to assess your cultural fit, problem-solving skills, and ability to work effectively in a remote team environment.
  4. Final Review: A final review with the hiring manager to discuss your qualifications and fit for the role.

Portfolio Review Tips:

  • Portfolio Structure: Organize your portfolio by project, highlighting your experience with monitoring and observability stacks, cloud platforms, and containerization tools.
  • Case Studies: Include detailed case studies that demonstrate your ability to design, build, and maintain end-to-end observability for distributed cloud services, as well as your proficiency in cloud platforms and services.
  • Code Samples: Provide code samples or snippets that showcase your ability to instrument code and services to expose business-relevant metrics and latency bottlenecks.
  • Performance Testing: Include examples of performance testing results or case studies that demonstrate your ability to automate performance regression testing and anomaly detection.

Technical Challenge Preparation:

  • Technical Phone Screen: Brush up on your knowledge of monitoring and observability stacks, cloud platforms, and containerization tools. Be prepared to discuss your experience and any relevant projects or case studies.
  • Technical Deep Dive: Review your portfolio and be prepared to discuss your technical skills and problem-solving abilities in detail. Familiarize yourself with Halcyon's products and services, and be prepared to discuss how your skills and experience align with the company's goals and objectives.
  • Behavioral Interview: Prepare for behavioral interview questions that focus on your problem-solving skills, ability to work effectively in a remote team environment, and cultural fit with Halcyon's values and mission.

ATS Keywords:

  • Programming Languages: Python, Scala, Go
  • Cloud Platforms: AWS, GCP, Azure
  • Cloud-Native Services: ECS, EKS, Lambda
  • Monitoring & Observability Stacks: Prometheus, Grafana, ELK, OpenTelemetry, Datadog, AWS CloudWatch
  • Containerization & Orchestration Tools: Docker, Kubernetes
  • Distributed Systems & Microservices: Distributed systems, microservices, performance debugging
  • CI/CD Pipelines & Infrastructure as Code: CI/CD pipelines, infrastructure as code (e.g., Terraform), version control (Git)
  • Endpoint Security: Endpoint security platforms, agent-based systems, SIEM, security analytics, cloud threat detection pipelines
  • Networking Performance: Networking performance, TLS handshake optimization, load balancing
  • SLA/SLO-Driven Operational Excellence: SLA/SLO-driven operational excellence, high-scale environments
  • Soft Skills: Problem-solving, collaboration, communication, adaptability, resilience

📝 Enhancement Note: Halcyon's technical interview process is designed to assess your technical skills, problem-solving abilities, and cultural fit with the company's values and mission. By preparing thoroughly and showcasing your relevant experience and skills, you can increase your chances of success in the interview process.

🛠 Technology Stack & Web Infrastructure

Frontend Technologies: N/A (This role focuses on backend and infrastructure technologies)

Backend & Server Technologies:

  • Monitoring & Observability Stacks: Prometheus, Grafana, ELK, OpenTelemetry, Datadog, AWS CloudWatch
  • Cloud Platforms: AWS, GCP, Azure
  • Cloud-Native Services: ECS, EKS, Lambda
  • Containerization & Orchestration Tools: Docker, Kubernetes
  • Infrastructure as Code: Terraform
  • Version Control: Git

Development & DevOps Tools:

  • CI/CD Pipelines: Jenkins, GitLab CI/CD
  • Automation Tools: Ansible, Puppet
  • Configuration Management: Chef, SaltStack
  • Infrastructure Provisioning: Terraform, CloudFormation
  • Server Management: Nagios, Zabbix
  • Log Aggregation & Analysis: ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, AWS CloudWatch

📝 Enhancement Note: Halcyon's technology stack is designed to support the company's mission to build cutting-edge cloud infrastructure and provide reliable, high-performance security operations for its customers. The company uses a combination of open-source and proprietary tools to ensure the scalability, reliability, and performance of its products and services.

👥 Team Culture & Values

Web Development Values:

  • Innovation: Halcyon values innovation and encourages its team members to think creatively and push the boundaries of what's possible in cloud-based security operations.
  • Collaboration: Halcyon emphasizes collaboration and cross-functional teamwork, with a focus on working together to achieve the company's goals and objectives.
  • Continuous Learning: Halcyon fosters a culture of continuous learning and encourages its team members to stay up-to-date with the latest trends and best practices in cloud-based security operations.

Collaboration Style:

  • Cross-Functional Integration: Halcyon's teams work closely together, with regular check-ins and updates to ensure everyone is aligned and working towards the same goals.
  • Code Review Culture: Halcyon emphasizes code review and peer programming practices to ensure the quality and reliability of its products and services.
  • Knowledge Sharing: Halcyon encourages knowledge sharing and technical mentoring, with regular training and development opportunities for its team members.

📝 Enhancement Note: Halcyon's team culture is characterized by its commitment to innovation, collaboration, and continuous learning, with a strong emphasis on working together to achieve the company's goals and objectives.

⚡ Challenges & Growth Opportunities

Technical Challenges:

  • Observability & Monitoring: Designing, building, and maintaining end-to-end observability for distributed cloud services can be challenging, requiring a deep understanding of distributed systems, microservices, and performance debugging.
  • Performance Optimization: Ensuring high performance, availability, and scalability of agent management systems in production can be complex, requiring a strong understanding of cloud platforms, cloud-native services, and containerization tools.
  • Incident Detection & Real-Time Monitoring: Supporting proactive incident detection and real-time monitoring strategies across multi-cloud environments can be challenging, requiring a strong understanding of monitoring and observability stacks, as well as a deep understanding of distributed systems and microservices.
  • Root Cause Analysis: Providing root cause analysis and performance tuning recommendations can be challenging, requiring strong problem-solving skills and a deep understanding of distributed systems, microservices, and performance debugging.

Learning & Development Opportunities:

  • Technical Growth: Halcyon offers opportunities for technical growth through its commitment to using cutting-edge cloud infrastructure and working on global security products.
  • Leadership Development: As a senior-level role, this position provides opportunities for technical leadership and mentoring within the Chaos Cloud Engineering team and across the broader organization.
  • Architecture Decisions: The Cloud Observability and Performance Engineer has the opportunity to make significant architecture decisions that impact the reliability, performance, and scalability of Halcyon's cloud-based security operations.

📝 Enhancement Note: Halcyon's commitment to using cutting-edge cloud infrastructure and working on global security products provides a unique opportunity for the Cloud Observability and Performance Engineer to grow both technically and professionally within the organization.

💡 Interview Preparation

Technical Questions:

  • Monitoring & Observability Stacks: Questions related to your experience with monitoring and observability stacks, such as Prometheus, Grafana, ELK, OpenTelemetry, Datadog, and AWS CloudWatch.
  • Cloud Platforms: Questions related to your proficiency in cloud platforms, such as AWS, GCP, and Azure, as well as your experience with cloud-native services like ECS, EKS, and Lambda.
  • Containerization & Orchestration Tools: Questions related to your experience with containerization and orchestration tools, such as Docker and Kubernetes.
  • Distributed Systems & Microservices: Questions related to your understanding of distributed systems, microservices, and performance debugging.
  • SLA/SLO-Driven Operational Excellence: Questions related to your experience with SLA/SLO-driven operational excellence in high-scale environments.

Company & Culture Questions:

  • Company Values: Questions related to Halcyon's values and mission, as well as your understanding of the company's commitment to innovation, collaboration, and continuous learning.
  • Technical Challenges: Questions related to your approach to technical challenges and your ability to provide root cause analysis and performance tuning recommendations.
  • Team Dynamics: Questions related to your experience working in remote teams and your ability to collaborate effectively with development, SRE, and security teams.

Portfolio Presentation Strategy:

  • Portfolio Structure: Organize your portfolio by project, highlighting your experience with monitoring and observability stacks, cloud platforms, and containerization tools.
  • Case Studies: Include detailed case studies that demonstrate your ability to design, build, and maintain end-to-end observability for distributed cloud services, as well as your proficiency in cloud platforms and services.
  • Code Samples: Provide code samples or snippets that showcase your ability to instrument code and services to expose business-relevant metrics and latency bottlenecks.
  • Performance Testing: Include examples of performance testing results or case studies that demonstrate your ability to automate performance regression testing and anomaly detection.

📝 Enhancement Note: Halcyon's interview process is designed to assess your technical skills, problem-solving abilities, and cultural fit with the company's values and mission. By preparing thoroughly and showcasing your relevant experience and skills, you can increase your chances of success in the interview process.

📌 Application Steps

To apply for this Cloud Observability and Performance Engineer position at Halcyon:

  1. Update Your Portfolio: Tailor your portfolio to highlight your experience with monitoring and observability stacks, cloud platforms, and containerization tools. Include detailed case studies and code samples that demonstrate your ability to design, build, and maintain end-to-end observability for distributed cloud services, as well as your proficiency in cloud platforms and services.
  2. Optimize Your Resume: Highlight your relevant experience and skills in your resume, focusing on your experience with monitoring and observability stacks, cloud platforms, and containerization tools. Include any relevant projects or case studies that demonstrate your ability to ensure high performance, availability, and scalability of agent management systems in production.
  3. Prepare for Technical Interviews: Brush up on your knowledge of monitoring and observability stacks, cloud platforms, and containerization tools. Review your portfolio and be prepared to discuss your technical skills and problem-solving abilities in detail. Familiarize yourself with Halcyon's products and services, and be prepared to discuss how your skills and experience align with the company's goals and objectives.
  4. Research Halcyon: Learn about Halcyon's company culture, values, and mission. Understand the company's commitment to innovation, collaboration, and continuous learning, and be prepared to discuss how your skills and experience align with the company's goals and objectives.

⚠️ Important Notice: This enhanced job description includes AI-generated insights and web development/server administration industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.

Application Requirements

5+ years of professional experience in observability, site reliability, or cloud performance roles is required. Strong experience with monitoring stacks and proficiency in cloud platforms and services is essential.