Site Reliability Engineer (SRE) - Object Storage
London, England, United Kingdom Software and Services
Description
The Apple Services Engineering (ASE) organization builds and provides systems and infrastructure that fuel Apple’s services (such as iCloud, iTunes, Siri, and Maps). At ASE, we are building and scaling high-performance, resilient, and efficient storage and analytics platforms that power critical insights across the company. Our team sits at the heart of distributed systems, big data, and large-scale infrastructure, ensuring that petabyte-scale workloads run smoothly, efficiently, and reliably. ASE runs the majority of its systems on Linux. We run a mix of open source, vendor-licensed, and internally developed tools to perform functions such as system configuration management, provisioning, software deployment, logging, and monitoring. You'll be expected to learn these tools and to improve them.
Minimum Qualifications
* Subject Matter Expertise in Object Storage and leading large-scale migration and modernization initiatives in the data analytics domain, providing expert guidance to customers as they transition to cutting-edge systems.
* Hands-on experience running analytics storage solutions such as HDFS or S3-compatible systems.
* Proficiency in designing, authoring, and releasing code in languages like Go or Python.
* Experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment.
Preferred Qualifications
* Knowledge of provisioning, data migration, disaster recovery, and capacity planning.
* Experience in automating repetitive tasks and processes to enhance reliability and efficiency.
* Good understanding of networking concepts, including TCP/IP stack, DNS, DHCP, and other standard network protocols.
* Contribution to team and organizational strategy, including participating in architectural reviews and decision-making processes.
* Hands-on experience managing large numbers of diverse systems with configuration management or software delivery platforms (such as Puppet, Ansible).
* Participate in on-call rotations and incident management processes to ensure rapid resolution of critical issues.
* Experience with monitoring tools like Splunk and Prometheus.
#J-18808-Ljbffr