Site reliability engineer

Posted: 13 April

Offer description

WELCOME LATAM CANDIDATES Job Description – Site Reliability Engineer (SRE) General Description We are looking for a passionate and detail-oriented Site Reliability Engineer (SRE) to help design, build, and maintain reliable infrastructure and cloud-based services. In this role, you will adopt and promote SRE best practices, improve observability, and work closely with engineering, security, and product teams to ensure scalable and resilient systems. You bring a strong background in cloud infrastructure, automation, and service ownership, and you thrive in dynamic environments where ambiguity means opportunity. You believe in automating everything, observing everything, and building systems that self-heal and scale. Roles and Responsibilities Apply SRE principles to the design, operation, and scaling of cloud services. Take ownership of the reliability and performance of critical infrastructure and applications. Participate in the on-call rotation, handling production incidents and driving root cause analysis. Build and manage Infrastructure-as-Code (IaC) using Terraform, Pulumi, or similar tools. Manage cloud environments (primarily AWS) and enterprise networking components like NGINX, load balancers, firewalls, VPCs, DNS, and security groups. Work with Kubernetes, Helm, and Spinnaker to orchestrate and manage containerized workloads. Develop tools and applications in Java, Python, or Go to improve system automation and observability. Collaborate with cross-functional teams to ensure service-level objectives (SLOs) are met. Continuously improve monitoring and alerting systems using Prometheus, Grafana, Splunk, or Datadog. Communicate proactively with stakeholders and leadership through reports, updates, and postmortems. Drive a culture of resilience, operational excellence, and continuous improvement. Education Bachelor’s degree in Computer Science, Engineering, or an equivalent combination of education and hands-on experience. 5 years of hands-on experience in infrastructure or site reliability engineering. Proven experience working with cloud-native environments and distributed systems. B2 English level, both written and spoken. Skills Soft Skills Clear and concise communication with technical and non-technical audiences. Strong analytical thinking and ability to manage complex systems. Comfortable with ambiguity, able to define and lead initiatives proactively. Thrives in fast-paced, high-stakes environments. Strong sense of ownership and accountability. Technical Skills Expertise in Amazon Web Services (AWS) or other cloud platforms. Proficiency in Infrastructure-as-Code tools like Terraform and Pulumi. Deep experience in enterprise networking : NGINX, load balancers, firewalls, VPCs, DNS, ACLs. Experience with containerized applications and Docker. Production-grade usage of Kubernetes, Helm, and Spinnaker. Programming/scripting in Python, Java, or Go. In-depth knowledge of build/release pipelines and automation practices. Advanced monitoring and observability with Prometheus, Grafana, Datadog, Splunk, or similar. Familiarity with CI/CD workflows, incident response, and recovery strategies. Experience leading or contributing to on-call rotations and incident response protocols. Certifications are a plus Cloud certifications (AWS Certified DevOps Engineer, GCP Professional SRE, etc.). Certifications in Kubernetes administration or Terraform. Contributions to open source or internal DevOps tooling. Experience implementing SLOs/SLIs and measuring error budgets .

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer

London

Teksystems

Site reliability engineer

Similar job

Site reliability engineer - us

London

Valarian Technologies Limited

Site reliability engineer

Similar job

Site reliability engineer

London

ION Group

Site reliability engineer

Site reliability engineer - remote position