What you’ll be doing
1. Champion SRE practices, including automation, monitoring, and incident response, to ensure the reliability and performance of our critical customer-facing platforms.
2. Collaborate with development, Test and operations teams to automate operational tasks, streamline CI/CD pipelines, and build robust monitoring systems that catch issues before they impact customers.
3. Drive continuous improvements in service reliability, availability, and scalability, using cloud-native technologies and infrastructure as code (IaC).
4. Manage incident response and post-incident reviews, ensuring that lessons learned are incorporated into improving our systems’ reliability.
5. Play a pivotal role in optimizing platform performance and reducing toil through automation and scripting (Python, Bash, etc.).
6. Help set and maintain high standards for availability and latency through SLAs, SLOs, and SLIs.
7. Collaborate with product owners, scrum masters, and technical leads to ensure that reliability is built into the design and delivery of every project.
8. Contribute to the cultural and technical adoption of SRE best practices across the team, encouraging continuous improvement, learning, and adaptation to new technologies.
What you'll bring
9. A deep understanding and passion for SRE, with experience in operating reliable services at scale across cloud and on-premise environments.
10. Strong Linux system administration skills, with a focus on performance optimisation, security, and automation.
11. Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana, ELK stack) and the ability to respond to incidents efficiently.
12. Experience with Agile methodologies, and a strong grasp of the DevOps and SRE mindsets.
13. Proficiency in building CI/CD pipelines, automating infrastructure with tools like Terraform or CloudFormation, and deploying in cloud environments (AWS, GCP).
14. Solid programming/scripting skills, with proficiency in languages such as Python or Bash.
15. A problem-solving mindset and the ability to tackle complex issues around scalability, reliability, and automation.
16. A passion for collaboration and learning, with a commitment to improving the resilience and reliability of services.
17. Experience with configuration management tools (Ansible, Chef, Git, Puppet)
What's in it for you
18. Tailored training and development opportunities to continue to build your career
19. 10% on target bonus
20. 25 days’ annual leave (not including bank holidays), increasing with service
21. Life Assurance
22. Pension scheme - If you pay in a minimum of 5% of your pensionable salary every month we will pay in 10%
23. Direct Share scheme
24. Option to join the Healthcare Cash Plan or other benefits such as dental insurance, gym memberships etc.
25. Exclusive colleague discounts on our latest and greatest BT broadband packages BT TV, including TNT Sports and NOW entertainment
26. Shared Parental leave - maximum amount of leave you can share with your partner is 50 weeks
27. 3 volunteering days per year
28. Access and involvement with our incredible 11 People Networks including Able2 network, Careers network, Ethnic diversity network, Gender equality network and Pride network