Reliability Engineering Architect - Private Cloud
π Job Overview
- Job Title: Reliability Engineering Architect - Private Cloud
- Company: Carbon60
- Location: Remote (Eastern Standard Time)
- Job Type: Contract (3 months)
- Category: DevOps / Site Reliability Engineering
- Date Posted: June 11, 2025
π Role Summary
- Architect and guide fault-tolerant, scalable systems that meet defined reliability targets, ensuring high availability and minimal downtime.
- Implement reliability engineering principles organization-wide, promoting a culture of resilience and stability.
- Define observability standards and partner with development teams to implement monitoring, alerting, and incident response protocols.
- Conduct system-wide failure mode and effects analysis (FMEA) to identify and mitigate reliability risks proactively.
- Promote and architect chaos testing frameworks to uncover weaknesses in systems and improve overall reliability.
- Mentor engineering teams on reliability engineering practices and act as a subject-matter expert, driving continuous improvement.
- Develop technical standards, architectural diagrams, and design documentation related to reliability, ensuring knowledge sharing and consistency across the organization.
π Enhancement Note: This role requires a strong background in system design, distributed systems, and automation to drive reliability improvements across the entire system landscape.
π» Primary Responsibilities
- Design and guide the architecture of fault-tolerant, scalable systems that meet defined reliability targets (SLAs, SLOs, SLIs).
- Develop and implement reliability engineering principles and best practices organization-wide, promoting a culture of resilience and stability.
- Define observability standards and partner with development teams to implement monitoring, alerting, and incident response protocols, ensuring quick detection and resolution of issues.
- Conduct system-wide failure mode and effects analysis (FMEA) to identify and mitigate reliability risks proactively, minimizing the impact of failures on the business.
- Promote and architect chaos testing frameworks to proactively uncover weaknesses in systems, simulating real-world failures and improving overall reliability.
- Design and build tools to automate resilience testing, incident response, and postmortem analysis, streamlining processes and reducing manual effort.
- Mentor engineering teams on reliability engineering practices and act as a subject-matter expert, driving continuous improvement and knowledge sharing across the organization.
- Develop technical standards, architectural diagrams, and design documentation related to reliability, ensuring knowledge sharing, consistency, and easy onboarding of new team members.
π Enhancement Note: This role requires a deep understanding of distributed systems, microservices architecture, and automation to effectively design and implement reliable systems at scale.
π Skills & Qualifications
Education: A Bachelor's or Masterβs degree in Computer Science, Engineering, or a related field, preferred.
Experience: 5+ years of experience in site reliability engineering, systems architecture, or related roles.
Required Skills:
- Proven experience in site reliability engineering, systems architecture, or related roles.
- Deep knowledge of private cloud platforms and container orchestration (e.g., Kubernetes, Docker).
- Strong background in system design, distributed systems, microservices architecture, and automation and orchestration.
- Experience with observability tools (e.g., Prometheus, Grafana, ELK Stack) and incident management platforms (e.g., PagerDuty, OpsGenie).
- Familiarity with compliance and security frameworks (e.g., SOC 2, ISO 27001, HIPAA) is a plus.
Preferred Skills:
- Experience with public cloud platforms (AWS, Azure) and multi-cloud environments.
- Knowledge of infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation).
- Familiarity with chaos engineering principles and tools (e.g., Chaos Monkey, ChaosKube).
- Experience with chaos testing frameworks and tools (e.g., Litmus, Chaos Toolkit).
π Enhancement Note: This role requires a strong technical background in reliability engineering, distributed systems, and automation to effectively design and implement reliable systems at scale.
π Web Portfolio & Project Requirements
Portfolio Essentials:
- System Design Case Studies: Documented examples of designing and implementing fault-tolerant, scalable systems, highlighting your approach to reliability engineering, system design, and architecture decisions.
- Chaos Testing & Resilience Projects: Demonstrate your experience with chaos testing frameworks and tools, showcasing how you've proactively uncovered weaknesses in systems and improved overall reliability.
- Incident Response & Postmortem Analysis: Provide examples of how you've handled incidents, conducted postmortem analysis, and implemented improvements to prevent similar issues in the future.
Technical Documentation:
- Architectural Diagrams & Design Documentation: Showcase your ability to create clear, concise, and well-structured technical documentation, highlighting your understanding of system design, architecture, and reliability principles.
- Reliability Engineering Standards & Best Practices: Demonstrate your understanding of reliability engineering principles and best practices, highlighting your ability to implement and promote these standards within an organization.
π Enhancement Note: This role requires a strong portfolio demonstrating your experience in reliability engineering, system design, and architecture, with a focus on chaos testing, incident response, and postmortem analysis.
π΅ Compensation & Benefits
Salary Range: $75/hr (This is an hourly contract role with a duration of 3 months)
Benefits:
- Flexible Work Hours: Carbon60 offers flexible work hours, allowing you to balance your career with other aspects of your life.
- Health and Wellness Incentives: The company provides health and wellness incentives to support the overall well-being of its employees.
- Fun Work Environment: Carbon60 fosters a fun and engaging work environment, encouraging team-building and camaraderie.
Working Hours: 8 AM - 5 PM EST (40 hours per week)
π Enhancement Note: Although the salary range is provided, it is essential to research regional salary standards and cost of living for a more accurate representation of compensation for this role.
π― Team & Company Context
π’ Company Culture
Industry: Carbon60 is a cloud solutions provider focused on helping companies securely manage their IT infrastructure in private and public cloud environments. They specialize in AWS and Azure solutions and have been consistently recognized by the industry and analysts for their growth, leadership, and exceptional service.
Company Size: Carbon60 is a mid-sized company with a focus on agility and innovation, thriving in a fast-paced environment. This size allows for a dynamic work environment where employees can make a significant impact on the company's success.
Founded: Carbon60 was founded in 2008 and has since grown to become a leader in cloud solutions, consistently recognized for its growth and industry accolades.
Team Structure:
- Reliability Engineering Team: This team is responsible for designing and implementing reliable systems, promoting a culture of resilience and stability across the organization. The team works closely with development teams to ensure quick detection and resolution of issues.
- Cloud Solutions Architects: These architects work with clients to design and implement secure, scalable, and reliable cloud solutions tailored to their specific needs.
- DevOps Engineers: The DevOps team focuses on automating and optimizing infrastructure, ensuring efficient and reliable deployment processes.
Development Methodology:
- Agile/Scrum Methodologies: Carbon60 employs Agile/Scrum methodologies for project management, ensuring quick iteration and continuous improvement.
- Code Review & Quality Assurance: The company emphasizes code review and quality assurance practices to maintain high coding standards and minimize technical debt.
- Deployment Strategies & CI/CD Pipelines: Carbon60 utilizes deployment strategies and CI/CD pipelines to automate and streamline the deployment process, ensuring quick and reliable releases.
Company Website: Carbon60 Website
π Enhancement Note: Carbon60's company culture emphasizes agility, innovation, and a focus on customer success. The company's size and industry position allow for a dynamic work environment where employees can make a significant impact on the company's growth and success.
π Career & Growth Analysis
Reliability Engineering Career Level: This role is a senior-level position, focusing on designing and implementing reliable systems at an organizational level. It requires a deep understanding of reliability engineering principles, distributed systems, and automation.
Reporting Structure: This role reports directly to the Director of Engineering, working closely with the Reliability Engineering team, Cloud Solutions Architects, and DevOps Engineers to ensure the reliability and stability of Carbon60's cloud solutions.
Technical Impact: As a Reliability Engineering Architect, you will have a significant impact on Carbon60's cloud solutions, ensuring high availability, minimal downtime, and quick detection and resolution of issues. Your work will directly contribute to the company's success and customer satisfaction.
Growth Opportunities:
- Technical Leadership: This role offers the opportunity to mentor engineering teams and act as a subject-matter expert, driving continuous improvement and knowledge sharing across the organization. As the company grows, there may be opportunities to take on more significant technical leadership roles.
- Architecture & Design: This role allows you to develop and refine your architecture and design skills, working on complex and challenging projects that push the boundaries of reliability engineering.
- Emerging Technologies: As Carbon60 continues to grow and expand its services, there will be opportunities to work with emerging technologies and stay at the forefront of the cloud solutions industry.
π Enhancement Note: This role offers significant growth opportunities for technical leadership, architecture and design, and working with emerging technologies. The company's focus on agility and innovation creates a dynamic environment where employees can continuously learn and develop their skills.
π Work Environment
Office Type: Carbon60 offers a remote work environment, allowing employees to work from the comfort of their own homes or a co-working space of their choice.
Office Location(s): As a remote company, Carbon60 does not have a physical office location. However, the company is based in Canada and primarily serves clients in North America.
Workspace Context:
- Remote Work: Carbon60's remote work environment allows for a flexible and balanced work-life schedule, with the ability to work from anywhere with an internet connection.
- Collaboration Tools: The company utilizes collaboration tools such as Slack, Microsoft Teams, and Google Workspace to facilitate communication and teamwork among remote employees.
- Cross-Functional Collaboration: Carbon60 encourages cross-functional collaboration between teams, ensuring that everyone's input is valued and considered in the decision-making process.
Work Schedule: The standard work schedule is 8 AM - 5 PM EST, with a 1-hour lunch break. However, Carbon60 offers flexible work hours to accommodate employees' personal schedules and needs.
π Enhancement Note: Carbon60's remote work environment offers a high degree of flexibility and work-life balance, allowing employees to work from anywhere with an internet connection. The company's focus on collaboration and cross-functional teamwork ensures that everyone's input is valued and considered in the decision-making process.
π Application & Technical Interview Process
Interview Process:
- Phone/Video Screen: A brief conversation to discuss your background, experience, and fit for the role. Be prepared to discuss your experience with reliability engineering, system design, and automation.
- Technical Deep Dive: A more in-depth discussion focused on your technical skills and experience. Be prepared to discuss your approach to reliability engineering, system design, and architecture decisions, as well as your experience with chaos testing, incident response, and postmortem analysis.
- Cultural Fit Interview: A conversation with a team member or hiring manager to assess your cultural fit with the company and team. Be prepared to discuss your work style, communication skills, and how you handle challenges and setbacks.
- Final Decision: The hiring team will review all candidates and make a final decision based on the interviews and your application materials.
Portfolio Review Tips:
- System Design Case Studies: Highlight your approach to designing and implementing fault-tolerant, scalable systems, emphasizing your understanding of reliability engineering principles, system design, and architecture decisions.
- Chaos Testing & Resilience Projects: Showcase your experience with chaos testing frameworks and tools, demonstrating how you've proactively uncovered weaknesses in systems and improved overall reliability.
- Incident Response & Postmortem Analysis: Provide examples of how you've handled incidents, conducted postmortem analysis, and implemented improvements to prevent similar issues in the future.
Technical Challenge Preparation:
- System Design & Architecture: Brush up on your system design and architecture skills, focusing on fault-tolerant, scalable systems and reliability engineering principles.
- Chaos Testing & Incident Response: Familiarize yourself with chaos testing frameworks and tools, as well as incident response and postmortem analysis best practices.
- Communication & Technical Explanation: Practice explaining complex technical concepts clearly and concisely, ensuring that you can articulate your ideas effectively during the interview process.
ATS Keywords: [Comprehensive list of reliability engineering, system design, and architecture-relevant keywords for resume optimization, organized by category: system design, reliability engineering, chaos testing, incident response, postmortem analysis, distributed systems, microservices architecture, automation, orchestration, observability tools, incident management platforms, compliance frameworks, security frameworks, technical standards, mentoring, documentation]
π Enhancement Note: Carbon60's interview process focuses on assessing your technical skills, cultural fit, and ability to work effectively in a remote environment. By preparing for the technical deep dive and cultural fit interview, you can demonstrate your qualifications and increase your chances of success in the interview process.
π Technology Stack & Web Infrastructure
Reliability Engineering Tools:
- Chaos Testing Frameworks & Tools: Familiarity with chaos testing frameworks and tools such as Chaos Monkey, ChaosKube, Litmus, or Chaos Toolkit is essential for this role.
- Observability Tools: Experience with observability tools such as Prometheus, Grafana, ELK Stack, Datadog, or New Relic is required for monitoring, alerting, and incident response.
- Incident Management Platforms: Familiarity with incident management platforms such as PagerDuty, OpsGenie, or VictorOps is essential for quick detection and resolution of issues.
Private Cloud Platforms & Container Orchestration:
- Private Cloud Platforms: Experience with private cloud platforms such as VMware vSphere, Microsoft Hyper-V, or OpenStack is required for this role.
- Container Orchestration: Familiarity with container orchestration tools such as Kubernetes, Docker Swarm, or Amazon ECS is essential for managing and scaling containerized applications.
π Enhancement Note: Carbon60's technology stack focuses on reliability engineering, chaos testing, incident response, and observability tools. Familiarity with these tools and private cloud platforms is essential for success in this role.
π₯ Team Culture & Values
Reliability Engineering Values:
- Resilience & Stability: Carbon60 values resilience and stability, emphasizing the importance of designing and implementing reliable systems that can withstand failures and maintain high availability.
- Proactive Problem-Solving: The company encourages a proactive approach to problem-solving, focusing on identifying and mitigating reliability risks before they cause significant issues.
- Continuous Improvement: Carbon60 emphasizes continuous improvement, encouraging employees to learn from failures and implement improvements to prevent similar issues in the future.
Collaboration Style:
- Cross-Functional Collaboration: Carbon60 encourages cross-functional collaboration between teams, ensuring that everyone's input is valued and considered in the decision-making process.
- Mentoring & Knowledge Sharing: The company fosters a culture of mentoring and knowledge sharing, allowing employees to learn from one another and continuously develop their skills.
- Fun & Engaging Work Environment: Carbon60 fosters a fun and engaging work environment, encouraging team-building and camaraderie among employees.
π Enhancement Note: Carbon60's team culture emphasizes resilience, stability, proactive problem-solving, and continuous improvement. The company's focus on cross-functional collaboration, mentoring, and knowledge sharing ensures that everyone's input is valued and considered in the decision-making process.
β‘ Challenges & Growth Opportunities
Technical Challenges:
- Chaos Testing & Incident Response: Designing and implementing chaos testing frameworks and incident response protocols can be complex and challenging, requiring a deep understanding of system design, architecture, and reliability engineering principles.
- Reliability Engineering at Scale: Implementing reliability engineering principles and best practices at an organizational level can be challenging, requiring a strong understanding of distributed systems, microservices architecture, and automation.
- Multi-Cloud Environments: Working with multi-cloud environments can present unique challenges, requiring a deep understanding of public cloud platforms (AWS, Azure) and their integration with private cloud environments.
Learning & Development Opportunities:
- Chaos Engineering Principles & Tools: Familiarize yourself with chaos engineering principles and tools to proactively uncover weaknesses in systems and improve overall reliability.
- Incident Response & Postmortem Analysis: Develop your incident response and postmortet analysis skills to ensure quick detection and resolution of issues, minimizing their impact on the business.
- Emerging Technologies: Stay up-to-date with emerging technologies and trends in the cloud solutions industry, ensuring that you remain at the forefront of the field.
π Enhancement Note: Carbon60's technical challenges and growth opportunities focus on chaos testing, incident response, and reliability engineering at scale. By embracing these challenges and pursuing continuous learning and development, you can grow your skills and make a significant impact on the company's success.
π‘ Interview Preparation
Technical Questions:
- System Design & Architecture: Be prepared to discuss your approach to designing and implementing fault-tolerant, scalable systems, emphasizing your understanding of reliability engineering principles, system design, and architecture decisions.
- Chaos Testing & Incident Response: Demonstrate your experience with chaos testing frameworks and tools, as well as incident response and postmortem analysis best practices.
- Communication & Technical Explanation: Practice explaining complex technical concepts clearly and concisely, ensuring that you can articulate your ideas effectively during the interview process.
Company & Culture Questions:
- Carbon60's Company Culture: Research Carbon60's company culture, values, and mission, and be prepared to discuss how your work style and communication skills align with the company's goals and objectives.
- Remote Work Environment: Familiarize yourself with the challenges and benefits of working in a remote environment, and be prepared to discuss how you maintain productivity and focus outside of a traditional office setting.
- Work-Life Balance: Carbon60 emphasizes work-life balance, allowing employees to work from anywhere with an internet connection. Be prepared to discuss how you maintain a healthy work-life balance in a remote work environment.
Portfolio Presentation Strategy:
- System Design Case Studies: Highlight your approach to designing and implementing fault-tolerant, scalable systems, emphasizing your understanding of reliability engineering principles, system design, and architecture decisions.
- Chaos Testing & Resilience Projects: Showcase your experience with chaos testing frameworks and tools, demonstrating how you've proactively uncovered weaknesses in systems and improved overall reliability.
- Incident Response & Postmortem Analysis: Provide examples of how you've handled incidents, conducted postmortem analysis, and implemented improvements to prevent similar issues in the future.
π Enhancement Note: Carbon60's interview process focuses on assessing your technical skills, cultural fit, and ability to work effectively in a remote environment. By preparing for the technical deep dive and cultural fit interview, you can demonstrate your qualifications and increase your chances of success in the interview process.
π Application Steps
To apply for this Reliability Engineering Architect - Private Cloud position:
- Submit your application through the application link provided in the job listing.
- Customize your resume to highlight your experience with reliability engineering, system design, automation, and orchestration, as well as your familiarity with chaos testing, incident response, and postmortem analysis tools.
- Prepare a portfolio showcasing your experience with system design case studies, chaos testing and resilience projects, and incident response and postmortem analysis examples.
- Research Carbon60's company culture, values, and mission to ensure that your work style and communication skills align with the company's goals and objectives.
- Practice explaining complex technical concepts clearly and concisely, ensuring that you can articulate your ideas effectively during the interview process.
β οΈ Important Notice: This enhanced job description includes AI-generated insights and reliability engineering industry-standard assumptions. All details should be verified directly with Carbon60 before making application decisions.
Application Requirements
Candidates should have over 5 years of experience in site reliability engineering or related roles, with a strong background in private cloud platforms and system design. A Bachelor's or Master's degree in Computer Science or a related field is preferred.