Our client is a scaling Cloud Service Provider specialising in delivering High-Performance Computing as a Service (HPCaaS) to enterprises globally. Their platform supports cutting-edge AI-native workloads and HPC environments, leveraging modern cloud-native technologies to drive innovation. This is an opportunity to work on infrastructure that powers the future of AI, ML, and advanced computational workloads.
The Role:
Infrastructure Design & Virtualisation
* Architect and implement virtualisation solutions optimised for AI and HPC workloads, with a focus on hypervisor performance tuning.
* Design dynamic, scalable infrastructure that meets evolving customer demands for storage and networking.
Bare-Metal and Operating System Management
* Lead provisioning, orchestration, and optimisation of bare-metal systems across global deployments.
* Ensure secure, high-performance configurations for Unix/Linux environments at scale.
Networking and High-Performance Storage
* Design and deploy cloud-native, high-performance storage and networking solutions tailored to demanding workloads.
* Leverage expertise in networking protocols (TCP, UDP, DNS, BGP) and software-defined networking (SDN) technologies.
Kubernetes and Cloud-Native Platforms
* Manage Kubernetes clusters across hybrid and multi-cloud environments, including container networking interfaces (CNIs) and service meshes.
* Develop CI/CD pipelines to automate infrastructure delivery and enhance operational reliability.
Observability and Automation
* Build observability pipelines integrating logging, metrics, and distributed tracing tools.
* Automate deployments and streamline operations with tools like Terraform, Ansible, Python, and Go.
Architecture & Solution Design
* Evaluate emerging technologies for scalability, security, and performance within the client’s platform.
* Create detailed technical and business-aligned architectural proposals.
* Collaborate with cross-functional teams to ensure successful solution delivery.
Collaboration and Leadership
* Foster a solution-driven mindset, championing innovative approaches to challenges.
* Mentor team members in infrastructure best practices and emerging technologies.
* Align infrastructure projects with broader organisational goals in partnership with engineering leaders.
Skills and Experience
* Expertise in AI/ML workloads, GPU-accelerated systems, or HPC infrastructures.
* Proven experience leading infrastructure architecture initiatives in agile environments.
* Advanced proficiency in Kubernetes, container networking (CNI), and service mesh technologies.
* Strong background in virtualisation technologies and hypervisor optimisation.
* Extensive experience in large-scale global deployments, especially in HPC or AI-native environments.
* In-depth knowledge of IaC tools like Terraform and Ansible.
* Proficiency in programming languages like Go or Python for automation.
* Familiarity with observability tools (e.g., Prometheus, Grafana) and distributed tracing systems.
Preferred Qualifications
* Bachelor’s degree in Computer Science, Information Technology, or a related field.
* 8+ years of experience in global-scale infrastructure design and deployment.
* Strong communication skills with the ability to convey complex concepts to diverse teams.
* Commitment to creating clear documentation for infrastructure processes and designs.