About the team Roku runs one of the largest data lakes in the world. We store over 70 PB of data, run 10M queries per month, scan over 100 PB of data per month. Big Data team is the one responsible for building, running, and supporting the platform that makes this possible. We provide all the tooling needed to acquire, generate, process, monitor, validate and access the data in the lake for both streaming data and batch. We are also responsible for generating the foundational data. The systems we provide include Scribe, Kafka, Hive, Presto, Spark, Flink, Pinot, and others. The team is actively involved in the Open Source, and we are planning to increase our engagement over time About the role We are seeking a skilled Engineer with exceptional DevOps skills to join our team. Responsibilities include automation and scaling of Big Data and Analytics tech stacks on Cloud infrastructure, building CI/CD pipelines, setting up monitoring and alerting for production infrastructure, and keeping our tech stacks up to date. What you'll be doing Develop best practices around cloud infrastructure provisioning, disaster recovery, and guiding developers on the adoption Scale Big Data and distributed systems Collaborate on system architecture with developers for optimal scaling, resource utilization, fault tolerance, reliability, and availability Conduct low-level systems debugging, performance measurement & optimization on large production clusters and low-latency services Create scripts and automation that can react quickly to infrastructure issues and take corrective actions Participate in architecture discussions, influence product roadmap, and take ownership and responsibility over new projects Collaborate and communicate with a geographically distributed team We're excited if you have 4 years of experience in DevOps or Site Reliability Engineering Experience with Cloud infrastructure such as Amazon AWS, Google Cloud Platform (GCP), Microsoft Azure, or other Public Cloud platforms. GCP is preferred. Experience with at least 3 of the technologies/tools mentioned here: Big Data / Hadoop, Kafka, Spark, Airflow, Presto, Druid, Opensearch, HA Proxy, or Hive Experience with Kubernetes and Docker Experience with Terraform Strong background in Linux/Unix. Experience with system engineering around edge cases, failure modes, and disaster recovery Experience with shell scripting, or equivalent programming skills in Python Experience working with monitoring and alerting tools such as Datadog and PagerDuty, and being part of call rotations Experience with Chef, or Puppet, or Ansible Experience with Networking, Network Security, Data Security Bachelor’s degree, or equivalent work experience. LI-EK1