Lead site reliability engineer, london

London

TN United Kingdom

Site reliability engineer

Posted: 17 February

Offer description

Social network you want to login/join with:

We are Vitesse – the treasury and payment partner of choice for insurance.

Formed in 2014 by a team of proven FinTech entrepreneurs, we are an FCA-regulated business providing global claim funds management and payment solutions. Operating one of the largest banking and payment settlement networks in the world, we give our customers direct access to 200 countries and currencies. Through a single integration, insurers can use this network to pay claims in as fast as 45 seconds and deliver a superior claimant experience. Our market-leading treasury proposition provides insurers with transparency and control over their claim funds, even when delegated to third-parties, allowing them to have their money in the right place, at the right time, to make that all-important payment when customers need it most.

With over 175 employees across our London headquarters, Europe, and the US, $93m Series C funding secured, and exceeding £10bn in processed transactions, we are only just getting started.

We are collaborative, customer centric and work with integrity, whilst partnering with some of the biggest insurance leaders including Lloyd’s of London and Many Pets. We take huge pride in our company culture, ensuring that everyone has a part to play, an opportunity to be heard, be involved, and the ability to make a real difference. As we continue to scale up, we want like-minded humans to join us on this exciting journey. Are you ready?

The Role:

The Lead SRE is responsible for ensuring the reliability, scalability, and operational excellence of our infrastructure and services. This is a hands-on engineering role, requiring deep technical expertise, leadership, and a commitment to continuous improvement. The Lead SRE must balance technical delivery with a strong focus on team productivity, performance measurement, and collaboration across squads and stakeholders.

The role requires close collaboration with engineering leads, business stakeholders, and the Head of Platform Operations to define and uphold SLAs, SLOs, and error budgets, ensuring alignment with business priorities. Communication is central, both in technical leadership within the team and in ensuring clear, proactive dialogue across teams and stakeholders.

A strong emphasis is placed on observability, ensuring systems are well-instrumented, reliable, and continuously improving in line with agile principles of Transparency, Inspection, and Adaptation (TIA).

Core responsibilities:

* Hands-On Engineering & Technical Leadership
* Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
* Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
* Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
* Set high engineering standards and provide mentorship to team members.
* Drive observability across all critical systems, ensuring real-time visibility into operations.
* Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
* Continuously improve monitoring, logging, and tracing strategies to support data-driven decisions.
* Think strategically and see the bigger picture, ensuring solutions align with both immediate technical needs and long-term business objectives.
* Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
* Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
* Improve deployment pipelines, automation, and operational workflows across squads, fostering consistency and best practices.
* Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.
* Ensure clear visibility of ongoing work, technical debt, and team progress.
* Define and track key engineering health metrics to measure and improve team effectiveness.
* Foster a culture of continuous improvement, driving agile practices, backlog refinement, and retrospectives.
* Embed blameless learning to improve reliability and efficiency across the team.
* Participate in the incident response process, working with the service management, ensuring delivery of structured post-mortems and continuous learning.
* Improve detection, response times, and resolution processes to minimise downtime.
* Identify recurring failure patterns and implement proactive risk mitigation strategies.
* Define and enforce SLAs, SLOs, and error budgets, working closely with engineering leads, business stakeholders, and the Head of Platform Operations.

Requirements

* Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
* Strategic mindset, able to align technical initiatives with business goals, drive scalability and performance improvements, and proactively tackle complex challenges.
* Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
* Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
* Strong understanding of regulatory and security requirements, such as ISO 27001, PCI DSS, CE+ and SOX, with experience implementing compliance-driven engineering practices.
* Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
* Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
* Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
* Strong networking and security knowledge, including cloud security best practices, identity management, and access controls.
* Experience in recruiting and scaling teams, driving engineering hiring decisions, shaping team culture, and mentoring engineers.
* Advocate for modern DevOps and SRE best practices, championing collaboration, transparency, automation, continuous learning, and continuous improvement across teams.
* Excellent communication skills, able to engage stakeholders, collaborate cross-functionally, and drive alignment on reliability and operational priorities.
* Adaptability and resilience, comfortable working in fast-paced environments, handling incidents, and participating in on-call rotations.

We are Vitesse – the payment provider of choice for the insurance and treasury industry.

We are an Equal Opportunity Employer. We are committed to creating an inclusive environment that enables everyone to perform at their best, where we recognise the rights of all individuals to mutual respect and where there is an unbiased acceptance of others. Our policies and practices aim to promote an environment that is free from all forms of unfair discrimination and values the diversity of all people. At the heart of our policy, we seek to treat people fairly and with dignity and respect.

J-18808-Ljbffr

Apply

Create E-mail Alert

Save

Similar job

17 feb 2025 srelonhr0225 site reliability engineer up to £150,000 london

London

Hunter Bond

Site reliability engineer

Similar job

Site reliability engineer, apple services

London

Apple

Site reliability engineer

Similar job

Site reliability engineer it · london ·

London

Cynergy Bank Limited

Site reliability engineer