About GSS
Hello. Welcome to GSS! We're transforming the global financial system with cutting-edge technology, including artificial intelligence and collaboration with top financial institutions. Our platform sets new standards in compliance screening for sanctions, making international payments faster, smoother, and friction-free. Join us in revolutionising the industry and making a real impact!
About the Role
This is an exciting opportunity to join our growing Operations team managing Kubernetes clusters in Production and, through a DevOps culture, empower development teams with observability insights they can use to innovate faster.
We are looking for a Site Reliability Engineer, or production experienced DevOps Engineer, who has working experience building observability for cloud native SaaS products and driving operational excellence.
You will be responsible for delivering our monitoring infrastructure, shaping observability, and responding to incidents as well as ensuring the platform is performant and reliable. You will be a key member of the team, liaising with product teams, embedding SRE principles and building the observability platform for the next stage of growth at GSS. You will have direct input into the direction of Technical Operations, solving problems, supporting developers and optimising the platform through code.
Plus, enjoy a collaborative, flexible, and innovative work culture where your ideas are valued.
What You’ll Do
Key responsibilities in this role will include (but not be limited to):
* Leveraging core SRE values - measuring (SLI/SLO/SLA), testing, and eliminating toil via automation with appropriate Disaster Recovery planning
* Refining KPIs to enable data-driven decision making for availability and reliability
* Proactively analysing monitoring data to ensure production services are running optimally and cost-efficiently
* Proactively tracking capacity, quotas and other performance indicators to plan for growth
* Working with development teams to ensure new features are maintainable, have well defined SLIs, achievable SLOs, are properly monitored, and evaluated for failure scenarios
* Enabling development teams through DevOps culture and the effective use of observability tools. Promote best practice, present KT sessions, help troubleshoot and resolve business affecting issues
* Building on our existing monitoring tools to deliver a comprehensive, optimised observability platform for logging, metrics and tracing to ensure suitable alerting scope
* Writing maintainable code to augment operations, scaling, resilience and observability
* Debugging production issues, mitigating swiftly and preventing reoccurrence
* Maintaining runbooks for manual tasks and replacing those runbooks with automation wherever viable
* Supporting junior members of the team to adopt best practice
* Participating in 24x7 on-call rotation, incident response, escalation, RCA and blameless post-mortems
Ideal Experience
What you’ll need:
* At least 3 years’ experience within a production, SaaS company (preferably event-driven)
* Be a self-starter that relishes responsibility. Take strategic direction and own end to end delivery of solutions.
* Expert knowledge of SRE fundamentals and a commitment to best practice
* Fluency with common observability tooling like Prometheus, Grafana, OTEL and Cloudwatch
* Experience analysing and building data telemetry, querying (PromQL), modelling, pipelines and dashboards to provide concise, focused insights and alerts for distributed systems
* Strong experience with Python and/or GoLang
* Java (SpringBoot and Micrometer) useful
* Demonstrable experience working with AWS services like SQS, EKS, RDS, VPC, EC2, Cloudwatch (X-Ray, Metrics and Logs), Lambda
* Solid knowledge of Linux systems and bash scripting
* Strong knowledge of networking and common protocols (TCP, DNS, TLS, HTTP)
* Experience with DevOps principles and tooling such as Infrastructure as Code (Terraform) and CI/CD (GitHub Actions, Jenkins)
* Knowledge of stream processing technologies like Kafka would be useful
* Experience working with ITSM systems like JSM, Zendesk or ServiceNow
* Experience building/maintaining automated incident management workflows
* Experience developing with containers and container orchestration (Docker & Kubernetes)
* Working knowledge and experience with Agile software development practices
* Strong communication, collaboration and documentation skills with proven experience working cross-functionally
* Ability to think about distributed systems in terms of failure modes and bottlenecks
* BSc/MSc in Computer Science, a related technical discipline, or equivalent experience
* Financial Services experience (or similar regulated industry) a bonus, but not essential
* Experience participating in Incident Response
What You Get in Return:
Impactful Work: Be part of a growing startup where your contributions make a real difference.
Generous Leave: Enjoy 30 days of holiday (plus bank holidays).
Comprehensive Benefits: Including a generous pension scheme, private medical insurance, and life assurance.
Wellbeing Perks: Access to EAP, YuLife, holistic wellbeing programs, and a Virtual GP for your health and happiness.
Flexibility: Hybrid working environment (we are open to remote working for some roles, please check with us at application) with a ‘work abroad’ policy for up to 4 weeks a year.
Learning: Access to Udemy, a learning platform with thousands of top-rated courses to develop both tech and business skills.
Ready to revolutionise finance and have fun doing it? Join GSS where we live by our values: Be Respectful, Be Bold and Take Ownership. Come join us and take your career to new heights!
Diversity statement
We are an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to, among other things, race, religion, gender, sexual orientation, gender identity, national origin, age or disability.
#J-18808-Ljbffr