Principal site reliability engineer

The Home

Bright Horizons

Site reliability engineer

Posted: 20 March

Offer description

Job: Principal Site Reliability Engineer Type: Full Time / Remote / Occasional Travel to our HO in Northampton Salary: £DOE Primary Purpose of the Role: The Principal Site Reliability Engineer (Principal SRE) plays a pivotal role in ensuring the seamless and reliable operation of an organization's digital infrastructure. This highly technical position will enhance the performance, scalability and reliability of the organization's complex systems and applications. It will reduce time to detect and restore systems, increase uptime and improve incident response by utilizing best practices in automation, monitoring, and incident management. This role requires a deep understanding of Cloud technologies, Distributed Systems, Automation / Scripting, Observability, Software Engineering, DevOps, and will take a proactive approach to preventing and mitigating potential issues. This role will report to the Director of Site Reliability Engineering, and will help foster a culture of innovation, continuous improvement, and collaboration within the team to meet the organization's evolving needs and deliver a superior digital experience to users. Our benefits include, but are not limited to: Flexible working and holiday entitlements Discounted childcare Quarterly Employee Appreciation Weeks Annual gala award evening Fantastic range of discounts on high street retailers, grocery stores, cinema tickets, holidays and more Wide range of wellbeing resources, supporting our teams for the ups and downs of daily life Why Bright Horizons? We’ve been voted Great Place to Work for the last 17 consecutive years, as well as being awarded the newly created Great Place for Wellbeing and Great Place for Women 2023. Our support functions enable our nurseries to deliver the best possible care and education to over 10,000 children across the UK. Through this support, our nurseries can deliver excellence – with 98% of our 300 portfolio being rated Good or Outstanding by Ofsted. We’re on a mission to change the future for children, families, and the people we work with, and are committed to progressive working values like flexibility, work-life balance, and wellbeing Essential Functions/Responsibilities Reliability and Scalability : Contribute significantly to the reliability, scalability and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure. Observability : Implement robust infrastructure, application and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyse system performance, and facilitate quick response to incidents. Create dashboards, alerts and automated workflows that can be utilized by other Operations or Application teams. Incident Management : Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers. Utilize monitoring and alerting to ensure timely incident resolution. Track KPIs like MTTD/MTTR and identify short-term and medium-term opportunities to improve. Conduct post-mortems to identify root cause and implement preventive measures. Automation and Efficiency : Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the Product, Engineering and SRE teams. Tools Ownership : Besides owning Observability tools, create a roadmap to expand and consolidate. This should provide a 360-degree view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure and Enterprise Architecture. Collaboration : Collaborate with the above cross-functional teams to drive a unified approach to site reliability that optimizes their work and improves time-to-market for all respective objectives. Foster strong relationships with these delivery organizations to implement an SRE culture that delivers organizational goals. Infrastructure Roadmap and System Capacity Planning : Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible, Terraform etc. Conduct disaster recovery and controlled failure testing to improve resiliency. Conduct capacity planning to handle current and future demand. Education and Essential Experience: Bachelor’s degree in computer science, Engineering, or related field - Required Master’s degree in computer science, Engineering, or related field - Preferred A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities - Required Essential Requirements: Bachelor’s degree in computer science, Engineering, or related field. A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities. Demonstrated ability to work with cross-functional Development, QE and Operations teams to understand the underlying architecture, and help improve its reliability and scalability. Strong understanding and experience in automation tools and programming/scripting languages (e.g., PowerShell, Python, Bash) to deliver improvements at a small and large scale. Strong understanding of Observability tools (e.g., Dynatrace, Datadog, New Relic etc.) and best practices, to implement effective monitoring of SLI/SLO/SLAs. Strong experience and understanding of software engineering, Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines. Strong troubleshooting skills coupled with making data-driven decisions during incidents, to improve time to detect and resolve issues. Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloud-native setups (AKS, serverless, etc.). A "can do" attitude is necessary, combined with a deep belief that everything can be automated, and systems must always be functional. Preference may be given to candidates with relevant certifications demonstrating cloud and reliability engineering expertise. Bright Horizons are committed to creating inclusive environments where everyone has a sense of belonging and has the opportunity to contribute and thrive in meaningful and impactful ways. We are an inclusive employer and welcome people from all backgrounds to apply. We will consider reasonable adjustments required by applicants. If you share our passion, values, and have most of the skills listed, we encourage you to apply – as you may be just what we are looking for Please note, due to our sector all roles are subject to an Enhanced DBS. Some of our roles require specific qualifications by law, this will be highlighted as essential within the advert We look forward to receiving your application If you experience any problems, please email europe.recruitmentbrighthorizons.com and we will be happy to help.

Apply

Create E-mail Alert

Save

See more jobs