Web Operations Engineer - Digital SaaS Platform, .net, Cloud, CI/CD, Infrastructure, Scalability - London/Hybrid - Perm - 65k Plus benefits
My client, a Global E-commerce company, is seeking to recruit an experienced Web Operations Engineer to join their team. This is an exciting time to join as they are going through an expansion, therefore presenting an opportunity to progress.
In this role, you will be responsible for the reliability, scalability, and performance of the company's digital platform and infrastructure. You will lead a small team of engineers and assist in the management of our external Azure Managed Service Providers.
Reporting to the Head of Engineering, you will be responsible for improving and optimizing our Azure platform. Responsibilities include incident and problem management, disaster recovery, release management, managing observability tools (Datadog), and improving the developer experience and tools. This role would be ideal for a .net developer who wants to get more involved in cloud/DevOps.
Duties include:
* SRE Strategy and Vision: Define and implement the overall strategy for SRE to align with organizational goals, balancing reliability, scalability, and development velocity.
* Service Uptime: Ensure systems and services meet agreed-upon service level agreements (SLAs) and SLOs for uptime and performance.
* Incident Management: Lead efforts to establish effective incident response protocols, including detection, triage, resolution, and post-incident reviews.
* Disaster Recovery: Oversee the development and testing of disaster recovery plans and procedures.
* Infrastructure as Code: Drive adoption and best practices for automation, ensuring repeatability and consistency in infrastructure provisioning.
* CI/CD Pipeline Optimization: Ensure seamless integration and delivery pipelines to support development and deployment at scale.
* Observability: Ensure comprehensive monitoring, logging, and alerting systems are in place to track the health and performance of systems.
* Incident Resolution: Lead and coordinate major incident responses, ensuring swift recovery while minimizing impact.
* Root Cause Analysis: Oversee post-mortem processes to identify root causes, document lessons learned, and implement preventive measures.
Looking for candidates with similar experience in the following:
* Ideally a background in .net development as you will be resolving incidents and working with the developers to fix code problems.
* Web operations engineering experience.
* Experience working with 3rd party Infrastructure Management (Azure MSPs).
* Experience with .NET technology - ideally.
* Experience working with large-scale codebase platforms.
* SRE Practices and Principles.
* Automation and Tooling.
* Monitoring & Observability.
* Performance Optimization.
* Incident & Disaster Recovery Management.
* Proven experience in scaling infrastructure.
* Excellent communication skills, both verbal and written.
* Strong organizational expertise and the ability to effectively multi-task.
* Strategic thinker.
* Data-driven decision maker.
* Security & Compliance - ideally.
* Cloud Native Architectures (e.g., Kubernetes, Docker) - ideally.
* Cloud Certifications - ideally.
Excellent benefits, training, and career progression.
#J-18808-Ljbffr