Senior Site Reliability Engineer
Observability is needed for a global pioneer in Cloud and Internet Intelligence. They are giving organizations visibility and insight into a borderless network, arming their clients with a precise understanding of how the network impacts their applications, users, and customers.
This role will be a unique opportunity for an experienced SRE to provide the tools, services, and infrastructure to monitor and observe the Platform. Leveraging cloud native tools and enabling the developers to instrument, analyze, and monitor the application.
Permanent position, Hybrid in London.
Responsibilities
Responsibilities involve:
1. Designing, deploying, and maintaining cloud-native monitoring services that are both elastic and resilient to failure across AWS.
2. Establishing standards and best practices for the instrumentation of container-based services and cloud-managed services.
3. Maintaining their pipeline to ensure that notifications are well-timed, accurate, and directed to the appropriate channels.
4. Implementing automation as a priority, allowing the monitoring platforms to scale smoothly and promoting a self-service approach.
Requirements
* Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
* Strong knowledge of modern logging tool sets, including Logstash or Fluentd.
* Understanding of Prometheus and its ecosystem, including Alertmanager.
* Good knowledge of Application Performance Monitoring tools and crash reporting tools, such as Sentry.
* Good knowledge of cloud provider managed services and how they can be leveraged in our context.
* Ability to write high-quality code in Python, Go, or equivalent languages.
This is an exciting opportunity for a Senior SRE to join an expanding global business. If you are interested, please apply with your CV.
#J-18808-Ljbffr