Company Overview:
microTECH Global LTD is a leading provider of cutting-edge technology solutions, dedicated to delivering innovative and efficient systems for large-scale AI development and training infrastructure.
About the Role:
We are seeking an exceptional High Performance Systems Engineer or Cloud Infrastructure Specialist to manage our large-scale AI development and training infrastructure. As a key member of our team, you will oversee GPU servers, Kubernetes clusters (Rancher), and storage systems for seamless operations and optimized performance.
Job Description:
* Kubernetes and Rancher Management: Configure, scale, and maintain Kubernetes clusters and Rancher for optimal performance and resource allocation.
* GPU Resource Management: Manage GPU resources and servers for efficient scheduling, load balancing, and performance optimization for AI workloads.
* Storage Management: Maintain and optimize large storage systems for high availability, performance, and data persistence.
* DevOps and Automation: Implement CI/CD pipelines and automate infrastructure management using Terraform, Ansible, Jenkins, and GitLab CI.
* Monitoring and Troubleshooting: Set up and manage monitoring and logging systems (e.g., Prometheus, Grafana, ELK) for rapid issue resolution.
* AI Framework Optimization: Collaborate with data scientists and AI developers to optimize AI frameworks (e.g., TensorFlow, PyTorch) for GPU and cluster environments.
* Security and Access Management: Implement and manage role-based access control (RBAC) and ensure data security, encryption, and backup procedures are in place.
Key Requirements:
* Proven experience in managing large-scale Kubernetes clusters and containerisation technologies (e.g., Docker).
* Strong understanding of GPU resource management and optimization for AI workloads.
* Expertise in managing large storage systems and implementing data persistence strategies.
* Proficiency in scripting and automation (Python, Bash, Go), with experience in infrastructure as code (IaC) using Terraform, Ansible, or similar tools.
* Familiarity with deep learning frameworks (e.g., TensorFlow, PyTorch) and experience optimizing them for large-scale environments.
* Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK.
Salary and Benefits:
The estimated salary range for this position is $120,000 - $180,000 per year, depending on qualifications and experience. We also offer a comprehensive benefits package, including medical, dental, and vision insurance, 401(k) matching, and generous paid time off.
Why Join Us:
At microTECH Global LTD, we value innovation, collaboration, and professional growth. As a High Performance Systems Engineer or Cloud Infrastructure Specialist, you will have the opportunity to work with cutting-edge technology, collaborate with experienced professionals, and contribute to the development of our company's future.
How to Apply:
Please submit your resume and cover letter to apply for this exciting opportunity. We look forward to hearing from you!