The role
The Bristol AI Supercomputing Centre runs the Isambard-AI National Artificial Intelligence Research Resource, and the Isambard3 Tier-2 Supercomputer. Isambard-AI is expected to be the most powerful supercomputer in the UK and amongst the most powerful in Europe.
The AI Supercomputing team owns the entire process of developing and operating the centre’s compute and software infrastructure, which includes:
* The sourcing of hardware and system design.
* The deployment of huge software-defined infrastructure using tools such as Kubernetes and Terraform.
* Building and operating platforms to enable researchers to conduct leading-edge research using the systems.
* Optimising and refining software to ensure environmental and economic efficient use.
As one of the largest Open AI Research Resources internationally, we are committed to catalysing an AI transformation in the research and development community.
As the AI Supercomputing Infrastructure Support Engineer, you will work as part of the AI Supercomputing Team to build and operate primarily the infrastructure and compute platforms that researchers use for their work. You do not need to be an ML/DL or computational research domain expert to deliver world-class infrastructure, but you do need to quickly obtain a deep technical understanding of new domains. You should enjoy being self-directed and identifying the most important problems to solve as the team matures with standardized tools and processes around stability, robust service delivery and scaling.
This role supports hybrid working arrangements to provide flexibility. While the Centre's operational needs require successful candidates to start as soon as possible, we are able to accommodate individual notice periods to attract top talent.
What will you be doing?
As a member of the AI Supercomputing Team, you will:
* Use tools such as bash, Terraform, Kubernetes, and Python.
* Assist in operating large, highly available supercomputing services managed as software-defined infrastructures, and integrated as complete computational experiments using tools such as JupyterHub, Kubernetes, and CSM (Cray System Management).
* You will experience operating massive-scale GPU and combined CPU/GPU workloads across these services.
* You will debug platforms and work closely with researchers as you co-design solutions that will enable the development and operation of new algorithms & software to solve leading-edge research problems.
You should apply if
You will find this work exciting if you:
* Want to help maintain some of the largest, modern software-defined supercomputing systems.
* Would enjoy working with world-class domain & AI researchers as your primary workload.
* Have supported development or operation of small to large clusters or dabbled in building your own physical or software-defined systems.
* Love understanding large distributed, highly available systems, and want to see them used for truly open national-scale research.
For our AI Supercomputing Infrastructure Support Engineer role, you’ll need:
* Domain knowledge in 1 or more areas from SysOps, NetOps, DevOps, SecOps, MLOps, or Research Software Engineering.
* Degree (or equivalent practical experience) in computer science, computational or ML/AI research, or in a natural science with a high degree of competence in computer science or computational research.
* Ability to follow guidance from experienced members of the team & follow operational procedures.
Additional information
Shortlisted candidates for this role will be interviewed over a number of tranches.
For informal queries please contact: Emma Rose at Emma.rose@bristol.ac.uk
£37,999 to £43,878. Grade I, per annum
#J-18808-Ljbffr