Site Reliability Engineer
Permanent
Clapham Junction / Hybrid
This is a hybrid role with 2 days a week at our HQ in East Croydon and 3 days working from home. However, in March 2025 we will be moving to our new HQ in Clapham Junction with 3 days a week in the office.
As a Site Reliability Engineer at The Gym Group (TGG), you’ll ensure fast, reliable, and delightful experiences for every user by maintaining highly available, performant, and observable cloud infrastructure. You will collaborate across Development, DevOps, InfoSec, QA, and SRE teams to continuously improve system reliability, deployment strategies, and alerting infrastructure.
Key duties & responsibilities:
* Maintain and enhance monitoring, logging, and alerting systems to proactively detect and resolve potential issues across our digital channels
* Collaborate with Development, Platform/DevOps, InfoSec, QA, and SRE teams, as well as with Technical Architects and the Digital Ops Manager, to ensure reliability and observability of infrastructure and applications.
* Optimise deployment strategies and streamline recovery processes to support high availability and performance in a cloud environment.
* Build resilient, observable systems using a modern stack that includes Terraform, Kubernetes, GitHub, Azure DevOps, Service Bus, Cosmos DB, Redis, and Cloudflare.
* Support the transition to a microservices-based architecture that leverages Microsoft Azure and Azure APM, while welcoming knowledge of other cloud providers and toolchains.
* Lead continuous improvement initiatives for deployment practices, monitoring, and alerting to ensure seamless user experiences.
* Contribute to incident response strategies, including detection, communication, and swift recovery processes, without on-call obligations outside of office hours
Essential Skills:
* Can articulate core SRE principles (e.g. Golden Signals, SLIs and SLOs, SRE metrics, release engineering, blameless retrospective, process capability) and apply them in practice
* Excellent log analysis and incident triage skills
* Performance monitoring
* Dashboard creation and alerting rules management
* Shell scripting and coding (e.g. bash, powershell, python)
* Understanding of Root Cause Analysis, Fault Tree Analysis, FMEA and/or similar safety engineering and reliability engineering methods
* Expertise with DevSecOps tools and methodologies and with Infrastructure-as-Code
* Deep experience with a major public cloud platform (e.g. Azure, AWS, GCP)
* Containerisation (docker, helm, etc)
* Awareness of network security and networking protocols
* Strong general computing knowledge (e.g. hardware performance metrics, software faults modes, vulnerability patching and hardening
Desirable Skills:
* Microsoft Azure (e.g VNETs, Storage Containers, Application Gateway, APIM, App Service)
* Kubernetes
* Azure DevOps (YAML Pipelines)
* Azure Monitor, Azure Application Insights
* Terraform
* FinOps and cloud infrastructure optimisation
* Cloudflare
* Azure Active Directory / Entra ID
* Deploying and supporting Nodejs stack applications
* Deploying and supporting dotnet stack applications
* GitOps / Policy-As-Code
* Design patterns for distributed systems (e.g. event-driven, microservices)
Benefits:
* 25 days holiday + plus bank hols.
* Pension is 5% employee and either 3 or 4% employer contribution depending on which scheme they opt for (auto-enroll or staff pension)
* Share purchase plan eligibility
* Save as you earn
* Cycle to work
* Gym membership from day 1, spouse/friend membership after 6 months
* Single private medical insurance (after 6 months)
* Up to 10% performance related bonus
* Life Assurance 3x annual salary