Site Reliability Engineering is responsible for delivering continuous improvement, automation, and self-service offerings to operational teams across Bank EMEA and Securities International.
Are you ready to apply Make sure you understand all the responsibilities and tasks associated with this role before proceeding.
Responsible for the reliability and efficiency of infrastructure through the delivery of common, repeatable tools and processes that greatly reduce the amount of toil operations must perform.
Member of the L3 Engineering team providing subject matter expertise and ultimate escalation.
Key Responsibilities:
1. Develop software to make infrastructure services self-managing and self-service.
2. Deliver continuous service improvement by developing Infrastructure as Code.
3. Eliminate manual, repetitive, automatable, tactical tasks that are devoid of value.
4. Develop pro-active monitoring solutions that alert on symptoms and not just on outages.
5. Perform detailed root cause analysis (RCAs) on incidents and outages to prevent future occurrences.
6. Identify technical debt and partner with application teams to build remediation plans.
7. Liaise with Infrastructure Control and IT Risk teams to satisfy internal and external audit requests.
8. Identify cost-saving and optimization opportunities across the group.
9. Identify SLOs (Service Level Objectives) to meet availability and latency objectives.
10. Maintain infrastructure in a highly available, reliable, secure, and performant manner.
Key Skills/Knowledge/Experience:
1. Exceptional skills in Microsoft Windows Server internals and related technologies.
2. Excellent skills in managing and maintaining Active Directory, DHCP, DNS, LDAP, and Kerberos.
3. Extensive experience in hardware performance monitoring and tuning complex low latency systems.
4. Agile, Site Reliability Engineering (SRE), and DevOps principles and practices.
5. Exceptional knowledge of scripting and programming languages such as PowerShell, Python, and C#.
6. Fluent in Backup and Recovery processes and procedures.
7. Advanced knowledge of Clustering, High-Availability, Replication, and Disaster Recovery techniques.
8. Excellent Performance Tuning skills, in-depth knowledge of system internals, performance counters, and performance measurement and analysis tools.
9. Infrastructure as Code principles and practices.
10. Continuous Integration (CI) and Continuous Development (CD) principles and practices.
11. Git, Ansible, Terraform, and TeamCity.
12. Excellent communication and interpersonal skills.
13. Ability to handle pressure during outages and systematically resolve issues.
14. Excellent problem-solving skills.
15. Results-driven, with a strong sense of accountability.
16. A proactive, motivated approach.
17. The ability to operate with urgency and prioritize work accordingly.
18. A structured and logical approach to work.
19. Attention to detail and accuracy.
20. Ability to perform well in a pressurized environment.
21. Ability to manage constructive conflict effectively.
22. The ability to manage large workloads and tight deadlines.
23. Able to communicate complex technical concepts to non-technical persons at all levels.
#J-18808-Ljbffr