In this key role, you will improve, drive, and embed non-functional and operational characteristics such as availability, performance, efficiency, change management, observability, security, incident response, and capacity planning of our products and services.
You will enjoy significant stakeholder interaction, working in collaboration with engineers to ensure a principled approach to deliver change in a safe and secure way.
This is a chance to join an inclusive team with a collaborative ethos and a commitment to innovation and professional development.
What you will do
As Site Reliability Engineer, we will look to you to lead the adoption of SRE practices as part of our SRE enablement team. You will work closely with our feature team and other colleagues to meet defined service level objectives and continually improve systems and environments. You will track and reduce toil, define SLIs, SLOs, and define error budgets that support finding the right balance between risk and reliability.
* You will also provide structure and help to our release process, suggesting and making improvements where possible. You will scale systems sustainably through mechanisms like automation, evolving them by pushing for changes that improve reliability and velocity. We will also look to you to coach and provide guidance to colleagues and the wider team, leading where required.
* Proactively contribute innovative ideas and innovations to meet short term and longer-term goals.
* Continually balance and manage any potential risks.
* Be accountable for the day-to-day health of both production and non-production environments and respond to any incidents as required.
* Provide exceptional support to our internal and external customers through proactively managing and pioneering streamlined solutions for internal and external production systems.
* Contribute to Site Reliability Operations (Production support, incident response, on-call rota, toil reduction, observability, security, application performance and codification).
* Balance feature development speed and reliability with well-defined service level objectives.
* Leading and coordinating major incidents in a complex multi-party environment.
* Proactively leads improvement to release quality into production and provide highly available, performing and secure production systems.
* Implement proactive monitoring and alerting to ensure proactive response to outages.
* Accountable for performance of internal systems and 3rd party supplier performance.
* Provide technical expertise and input to establish the risk tolerance of products and services.
* Communicate incident status updates clearly and frequently to other teams, customers, and stakeholders.
Key Skills and Experience:
* Strong knowledge of reliability systems thinking and experience of software engineering. You will need experience of using a data driven and scientific approach to fact finding.
* Prior experience in establishing Site Reliability Engineering function with 24/7 support.
* Coding experience and demonstrate how to build, test, scan and deploy a .NET and JavaScript application.
* Hands-on experience of Azure cloud, IaC, JSON, Azure Bicep, Azure policies, Azure DevOps, Open telemetry, Azure Monitoring, Azure Sentinel, Azure Defender, Grafana, Kusto queries, Kubernetes AKS, Azure ARC, Azure function apps.
* Excellent knowledge of DevOps, Security, and IT Service Management.
* Hands-on experience with Azure Cloud and Full Stack Observability using tools such as Log Analytics and AppInsights.
* Deep knowledge of Kubernetes and Prometheus.
* Experience on GitOps practices.
* Understanding Shift to Right approaches and have experience with chaos engineering.
* A proactive approach to spotting problems, areas for improvement and performance bottlenecks.
* Knowledge of automation of IT request fulfilment process through orchestration, ServiceNow.
* Knowledge of cloud native, micro services including containerisation and API Management.
* Effective communication and presentation skills.
* Financial services knowledge, and the ability to identify wider business impact, risk, and opportunity, and make connections across key outputs and processes.
#J-18808-Ljbffr