Referment are working closely with a top multi-strategy hedge fund who are looking to onboard a Senior SRE to develop and drive SRE practices, standards, and processes across its Platform group. This role will ensure the reliability and scalability of trading systems and production environments on both cloud and on-prem platforms, collaborating with DevOps and Cloud teams.
Responsibilities:
* Promote and implement SRE best practices and processes.
* Document systems, processes, and incident post-mortems.
* Mentor teams on SRE principles and automation.
* Implement observability and monitoring (Prometheus, Grafana, AWS CloudWatch).
* Define reliability standards for Kubernetes and cloud environments.
* Automate deployment pipelines and health checks.
* Monitor infrastructure metrics and proactively resolve issues.
* Foster a "reliability by default" approach to software delivery.
Tech Stack:
* Languages : Python, Java, NodeJS, Shell
* Cloud : AWS
* CI/CD : TeamCity, Jenkins, Octopus
* Containers : Kubernetes, Docker
* Monitoring : Prometheus, Grafana, Sentry, CloudWatch
Requirements:
* 5+ years in SRE or related roles with complex distributed systems.
* Bachelor’s degree or equivalent in computer science/engineering.
* Proficiency with Kubernetes, cloud platforms (AWS), and SRE tools.
* Strong scripting skills (Python, Bash, Go).
* Experience with CI/CD, DevOps, and automation.
* Excellent troubleshooting, problem-solving, and communication skills.
Desireable:
* Experience in high-throughput/low-latency environments.
* AWS, Azure, or GCP certifications.
* Open-source contributions or Chaos Engineering experience.