Social network you want to login/join with: Client: Mentmore Recruitment Location: London, United Kingdom Job Category: Other EU work permit required: Yes Job Views: 5 Posted: 12.02.2025 Expiry Date: 29.03.2025 Job Description: Lead Site Reliability Engineer - Azure/AWS - Terraform - London My financial services client is looking for a Lead Site Reliability Engineer who will be responsible for ensuring the reliability and scalability of their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders. Duties & Responsibilities: Hands-On Engineering & Technical Leadership Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation. Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability. Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment. Promote transparency, inspection, and adaptation by making both system and team health data accessible and actionable. Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency. Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth. Improve deployment pipelines, automation, and operational workflows across squads, fostering consistency and best practices. Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience. Experience Required: Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence. Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments. Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance. Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working. Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience. Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability. Strategic mindset, able to align technical initiatives with business goals, drive scalability and performance improvements, and proactively tackle complex challenges. Strong understanding of regulatory and security requirements, such as ISO 27001, PCI DSS, CE and SOX, with experience implementing compliance-driven engineering practices. Advocate for modern DevOps and SRE best practices, championing collaboration, transparency, automation, continuous learning, and continuous improvement across teams. Excellent communication skills, able to engage stakeholders, collaborate cross-functionally, and drive alignment on reliability and operational priorities. J-18808-Ljbffr