SRE Expert (Full Onsite)
Location: London
Responsibilities:
1. Perform end-to-end Self-Healing automation solution to reduce manual effort/TOIL.
2. Collaborate with the Production support team, identify existing manual activities, and automate.
3. Identify toil areas where automation can avoid manual intervention.
4. Build a Monitoring system and observability platform for enhanced stack traces, alerts, and dashboards.
5. Define SLA, SLO, and SLI and implement them for better monitoring.
6. Focus on scalability, reliability, and observability to reduce MTTD and MTTR.
Technical Skills:
1. Ansible, Terraform, Python, DevOps, SRE, Docker, AWS (Atlas), ECS-based internal tooling.
2. Shell Scripting, Linux, Monitoring tools – Datadog, Splunk, Dynatrace, Grafana, Thousand Eyes.
3. 5 to 9 years of experience with automation principles and tools.
4. Advanced working experience with Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.
5. Experience with Python, Java, Curl scripting, or other scripting types.
6. Experience with JIRA, Confluence, BitBucket, GitHub, Jenkins, Jules, Terraform.
7. Experience with observability tools: AppDynamics, Geneos, Dynatrace, CloudWatch, Big Panda, Elastic Search (ELK), Google Cloud Logging, Prometheus.
8. Experience in creating dashboards for Infra/APM/E2E workflows.
9. Effective production management – Incident & Change Management, ITSM, Service Now.
10. Hands-on experience in SRE implementation of monitoring systems and dashboard development.
11. Experience working on Configuration as Code, Infrastructure as Code, AWS (Atlas).
Overall, we are looking for an Automation Engineer who can reduce toil issues and enhance the system's reliability and scalability.
#J-18808-Ljbffr