Application Site Reliability Engineer
📍 Job Overview
- Job Title: Application Site Reliability Engineer
- Company: Capital on Tap
- Location: London, United Kingdom
- Job Type: Hybrid (2 days per week in the office)
- Category: DevOps, Site Reliability Engineering
- Date Posted: June 17, 2025
🚀 Role Summary
- Key Responsibilities:
- Design, build, and monitor systems to maximize uptime and efficiency
- Collaborate with platform teams to build reliable, scalable applications
- Proactively address potential outages and performance issues
- Implement structured monitoring and alerting to prevent incidents
- Define service-level agreements (SLAs) and service-level indicators (SLIs) to ensure reliability
- Work closely with the product team to launch new features
💻 Primary Responsibilities
- Design and Implement Highly Available and Scalable Systems: Ensure the reliability and performance of the company's website or application by designing and implementing highly available and scalable systems.
- Collaborate with Cross-Functional Teams: Define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems by collaborating with cross-functional teams.
- Monitor Systems and Applications: Proactively identify and resolve any performance bottlenecks or availability issues by monitoring systems and applications.
- Develop and Maintain Monitoring Tools: Provide visibility into system health and performance by developing and maintaining monitoring tools, alerts, and dashboards.
- Conduct Post-Incident Analyses: Identify root causes and implement preventive measures to avoid future incidents by conducting post-incident analyses.
- Automate Repetitive Tasks: Improve efficiency and reduce manual intervention by automating repetitive tasks and processes.
- Create and Maintain Documentation: Ensure optimal system performance and scalability by creating and maintaining documentation for system architecture, configuration, and troubleshooting procedures.
- Perform Capacity Planning: Ensure optimal system performance and scalability by performing capacity planning and resource allocation.
- Collaborate with Development Teams: Implement and deploy new features and enhancements while ensuring they meet reliability and performance standards by collaborating with development teams.
- Stay Up to Date with Industry Best Practices: Stay informed about industry best practices, new technologies, and emerging trends in site reliability engineering.
🎓 Skills & Qualifications
Education: A relevant degree or equivalent experience in a related field.
Experience: Proven experience in managing a public cloud, preferably Azure.
Required Skills:
- Experience in managing a public cloud (Azure advantageous)
- Experience in Azure DevOps, Octopus, Flux, GitHub, or other CI/CD tools
- Experience in Python, PowerShell, C#, or other scripting languages
- Experience with Linux and Microsoft Systems
- Excellent communication skills and ability to collaborate with multiple teams in an agile environment
- Strong problem-solving and troubleshooting skills
- Expertise in monitoring and logging tools (Datadog advantageous)
- Experience with Kubernetes and containerization
- Experience with setting and adjusting SLOs working with product teams
Preferred Skills:
- Experience with IaC tools such as Terraform
- Knowledge of service mesh technologies such as Istio
- Experience with SQL databases
📊 Web Portfolio & Project Requirements
Portfolio Essentials:
- Demonstrate experience in managing a public cloud and implementing CI/CD pipelines
- Showcase problem-solving skills and incident management processes
- Highlight experience with monitoring tools and alerting systems
- Display expertise in Kubernetes and containerization
Technical Documentation:
- Provide code quality, commenting, and documentation standards
- Include version control, deployment processes, and server configuration
- Demonstrate testing methodologies, performance metrics, and optimization techniques
💵 Compensation & Benefits
Salary Range: Competitive salary based on experience and industry standards for site reliability engineers in London.
Benefits:
- Private healthcare including dental and opticians services through Vitality
- Worldwide travel insurance through Vitality
- Anniversary rewards (£250, £500, £750, 4-week fully paid sabbatical)
- Salary sacrifice pension scheme up to 7% match
- 28 days holiday (plus bank holidays)
- Annual learning and wellbeing budget
- Enhanced parental leave
- Cycle to work scheme
- Season ticket loan
- 6 free therapy sessions per year
- Dog-friendly offices
- Free drinks and snacks in offices
🎯 Team & Company Context
🏢 Company Culture
Industry: Fintech, focusing on small business credit card and spend management.
Company Size: Medium-sized, with over 200,000 businesses served worldwide and a goal to help 1 million small businesses by 2030.
Founded: 2012, in London, United Kingdom.
Team Structure:
- Embedded SRE model working closely with platform teams
- Hybrid work environment with 1-2 days per week in the office
Development Methodology:
- Agile and scaling environment, empowering innovation and problem-solving
- Collaborative culture with cross-functional teams and continuous learning
Company Website: Capital on Tap
📝 Enhancement Note: Capital on Tap's mission and culture emphasize empowering small business owners and fostering innovation, collaboration, and continuous learning.
📈 Career & Growth Analysis
Web Technology Career Level: Mid-level site reliability engineer role with a focus on system design, monitoring, and incident management.
Reporting Structure: Embedded within platform teams, working closely with team leads and other SREs.
Technical Impact: Significant influence on system reliability, performance, and availability, ensuring optimal user experience and business continuity.
Growth Opportunities:
- Technical leadership and architecture decision-making opportunities
- Specialization in emerging technologies and trends in site reliability engineering
- Career progression paths within the growing fintech company
📝 Enhancement Note: Capital on Tap's fast-growing and profitable nature presents numerous career growth opportunities for site reliability engineers.
🌐 Work Environment
Office Type: Hybrid, with 1-2 days per week in the office located in Shoreditch, London.
Office Location(s): London, United Kingdom.
Workspace Context:
- Collaborative workspace with a focus on team interaction and knowledge sharing
- Access to development tools, multiple monitors, and testing devices
- Dog-friendly offices with a relaxed and casual work environment
Work Schedule: Flexible working hours with project deadline and maintenance window considerations.
📝 Enhancement Note: Capital on Tap's hybrid work arrangement and flexible working hours emphasize work-life balance and employee well-being.
📄 Application & Technical Interview Process
Interview Process:
- First stage: 30-minute intro and values call with a talent partner (video call)
- Second stage: 45-minute CV overview with the head of the department, engineering team leads, and/or product managers (video call)
- Final stage: 60-minute questions and scenario-based interview with the SRE team lead (video call)
Portfolio Review Tips:
- Tailor the portfolio to showcase experience in managing public clouds, CI/CD pipelines, and incident management
- Highlight problem-solving skills and technical expertise in monitoring tools and alerting systems
- Include examples of Kubernetes and containerization experience
Technical Challenge Preparation:
- Familiarize yourself with Azure DevOps, Octopus, Flux, GitHub, or other CI/CD tools
- Brush up on Python, PowerShell, C#, or other scripting languages
- Prepare for problem-solving and incident management scenarios
ATS Keywords: (Comprehensive list of web development and server administration-relevant keywords for resume optimization, organized by category)
- Programming Languages: Python, PowerShell, C#, Azure DevOps, Octopus, Flux, GitHub
- Web Frameworks & Libraries: N/A
- Server Technologies: Linux, Microsoft Systems, Kubernetes, Containerization
- Databases: SQL Databases
- Tools: Monitoring Tools, Datadog, IaC Tools, Terraform, Service Mesh Technologies, Istio
- Methodologies: Agile, Scrum, CI/CD, Site Reliability Engineering
- Soft Skills: Problem-solving, Troubleshooting, Communication, Collaboration, Incident Management
- Industry Terms: Public Cloud Management, Azure, SLOs, SLAs, IaC, IaC Tools, Terraform, Kubernetes, Containerization, Monitoring Tools, Datadog, Service Mesh Technologies, Istio
📌 Application Steps
To apply for this site reliability engineer position at Capital on Tap:
- Submit your application through the application link provided
- Tailor your resume to highlight relevant skills and experience in managing public clouds, CI/CD pipelines, and incident management
- Prepare a portfolio showcasing your experience in monitoring tools, alerting systems, and Kubernetes/containerization
- Familiarize yourself with Capital on Tap's company culture, mission, and values
- Research the company's fintech industry context and small business focus
⚠️ Important Notice: This enhanced job description includes AI-generated insights and web development/server administration industry-standard assumptions. All details should be verified directly with the hiring organization before making application decisions.
Application Requirements
Required skills include experience in managing a public cloud, CI/CD tools, and scripting languages. Strong problem-solving skills and expertise in monitoring tools are also essential.