Find the latest job opportunities in AI and tech.
RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Contribute to a company with a global impact based in the US, Canada, and Europe.
Experience Requirements:
1. 5+ years of experience in Site Reliability Engineering or a similar role
2. 3+ years of experience in a technical leadership or management position
3. Deep understanding of Linux systems, containerization, virtualization, and networking technologies
4. Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets
5. Expertise in infrastructure-as-code and configuration management tools
Responsibilities:
1. Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation, continuous learning, and technical excellence
2. Develop and implement strategic plans to enhance the reliability, scalability, and efficiency of our infrastructure
3. Collaborate with cross-functional teams to align SRE initiatives with broader organizational goals
4. Establish and maintain SLIs, SLOs, and SLAs for critical systems and services
5. Drive the adoption of best practices in automation, monitoring, and incident response
Software Engineer, Site Reliability Engineer.
Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.
Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI's potential while ensuring governance, security, compliance, and control.
Experience Requirements:
1. Proven experience in a senior SRE role or similar.
2. Strong knowledge of cloud technologies and SLA SLO SLI management.
3. Experience leading teams and implementing SCRUM processes.
4. Excellent communication and leadership skills.
5. Experience line managing, mentoring, and coaching.
Responsibilities:
1. Collaborate with the Principal SRE to shape and implement the SRE strategic plan.
2. Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process.
3. Address wellbeing and performance concerns, fostering a positive and productive team environment.
4. Work with the Principal SRE and Scrum Master to analyze wellbeing survey outcomes and develop improvement plans.
Invisible AI is an on-premise computer vision platform for manufacturing that uses AI to improve worker productivity and safety by analyzing manual assembly work.
Education Requirements:
1. Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent experience.
Experience Requirements:
1. 5+ years of experience building and managing infrastructure at scale, particularly on the edge.
2. Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
3. Strong expertise with infrastructure automation tools (Terraform, Ansible).
4. Experience managing observability and monitoring systems, particularly Prometheus.
5. Deep understanding of networking concepts and protocols.
Responsibilities:
1. Design, build, and maintain scalable and resilient infrastructure on the edge.
2. Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
3. Deploy and manage containerized applications using Docker and related technologies.
4. Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
5. Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).
xAI's Grok is a powerful, multilingual large language model available on X and via API, focused on accelerating scientific discovery.
Experience Requirements:
1. Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go.
2. Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty.
3. Expert knowledge of deployment technologies such as Pulumi or Terraform.
4. Expert knowledge of Kubernetes.
Responsibilities:
1. Improving our observability by adding/adjusting metrics.
2. Building easily parsable dashboards.
3. Designing and overseeing our on-call rotations.
4. Improving our deployment process to increase reliability.
Luminance is an AI-powered legal tech platform that streamlines contract lifecycle management with features including AI-powered negotiation and an intelligent contract repository.
Education Requirements:
1. Bachelor's or Master's degree with a First or 2:1, preferably in a technical subject.
Other Requirements:
1. Excellent problem-solving skills, including diagnosing issues within complex systems.
2. Ability and desire to identify root causes of issues, and propose and implement structural improvements.
3. Strong communication skills and capability to perform in scenarios with urgency.
4. Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL, or Elasticsearch.
5. Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, Python, AWS, Ansible, Prometheus.
Senior Site Reliability Engineer (Remote)
Fathom is a free AI meeting assistant that records, transcribes, and summarizes your meetings, saving you time and improving productivity.
Experience Requirements:
1. 6+ years.
Responsibilities:
1. Scaling existing tools.
2. Enhancing automation for scaling infrastructure.
3. Playing a key role in diversifying and scaling platform.
4. Evaluating options to replace existing real-time data pipeline.
5. Providing platform support to engineering.
AppTek.ai provides AI-powered speech and language solutions including ASR, NMT, NLP/U, LLMs, and TTS, serving diverse industries globally.
Education Requirements:
1. BS in a field related to Computational Linguistics, Computer/Data Science.
Experience Requirements:
1. 2+ years of industry experience (desirable for Site Reliability Engineer role).
Other Requirements:
1. Strong knowledge of Linux.
2. Strong knowledge of AWS.
3. Docker.
4. Scripting languages (Bash, Python).
5. Familiarity with load-testing tools.
6. Must be U.S. citizen capable of obtaining a Secret clearance (for Computational Linguist and Linguist roles).
Responsibilities:
1. On-call first-level response.
2. Respond to customer issue reports.
3. Troubleshoot problems to maintain service SLAs.
4. End-to-end monitoring across infrastructure and services for metrics/alerts/logs.
Linc's CX automation platform uses AI to streamline retail customer service, boosting efficiency and delighting customers.
Education Requirements:
1. B.S. in Computer Science or a related field.
Experience Requirements:
1. 1+ years of site reliability engineering experience.
Other Requirements:
1. Familiarity with at least one cloud service provider, preferably AWS.
2. Familiar with basic SQL commands and Intent protocols.
3. Proficient in cloud application orchestration tools like Kubernetes, Helm.
4. Experience with monitoring stacks, preferably Datadog.
Responsibilities:
1. Collaborate with engineering teams to define and maintain services SLA.
2. Monitor metrics, alerts, logs across infrastructure and applications.
3. Create and maintain tools to monitor the platform.
4. Respond to incidents, troubleshoot, investigate root causes.
5. Conduct post-incident investigation and report.
QED.ai provides AI-driven solutions for data scarcity in health and agriculture, offering tools for data digitization, geospatial mapping, and spectroscopy.
Travel to exotic places around the world.
Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.
Experience Requirements:
1. 3+ years in site reliability engineering, Kubernetes administration, or related role.
2. Deep expertise of Kubernetes and containers.
3. Strong understanding of cloud infrastructure, automation tools, and best practices for high availability and performance.
Responsibilities:
1. Monitor system performance and reliability.
Hebbia is an enterprise-grade AI platform that empowers knowledge workers by automating complex tasks and providing insights from various data sources. It’s designed for seamless integration and high security.
Experience Requirements:
1. 4+ years software development experience at a venture-backed startup or top technology firm.
2. Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
3. Strong expertise in managing CI/CD pipelines and deployment automation.
4. Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
5. Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.
Other Requirements:
1. Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar.
2. Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation.
3. Familiarity with security best practices and tools for infrastructure and application security.
4. Excellent problem-solving skills and the ability to troubleshoot complex issues.
5. Strong communication skills and the ability to work effectively in a collaborative environment.
6. A proactive and self-motivated approach to learning and adopting new technologies.
7. Passion for continuous improvement and operational excellence.
Responsibilities:
1. Assist in managing deployment pipelines to facilitate smooth and efficient software releases.
2. Help implement and maintain observability solutions for monitoring system performance and reliability.
3. Support local development environments to optimize developer workflows.
4. Work with development teams to ensure infrastructure aligns with project requirements.
5. Contribute to improving the security of our infrastructure by assisting with proactive measures and audits.
6. Assist in developing and maintaining automation scripts and tools to enhance operational efficiency.
7. Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations.
8. Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure.
Abacus.AI provides AI-powered solutions for both individual professionals and large enterprises. Their tools include ChatLLM, a versatile AI assistant, and Abacus.AI Enterprise, a platform for automating business processes and building custom AI models.
Experience Requirements:
1. 2+ years professional experience in hands-on engineering roles including operating production environments in public clouds: AWS, GCP, Azure.
Other Requirements:
1. Python programming experience in production environments.
2. Experience with modern cloud environments: containerization, infrastructure-as-code, DevOps, CI/CD pipelines, and automation.
Responsibilities:
1. Building, tuning, and operating the entire infrastructure that powers Abacus.AI's multi-cloud SaaS products.
Senior Site Reliability Engineer - SRE - 12 months rolling contract.
GoodNotes is an AI-powered note-taking app offering a seamless digital pen-and-paper experience across multiple platforms.
Budget for things like noise-cancelling headphones, setting up your home office, personal development, professional training, and health & wellness.
Sponsored visits to our Hong Kong or London office every 2 years.
Company-wide annual offsite.
Medical insurance for you and your dependents. This is a 12-month renewable fixed-term contract. We expect 40 hours of work per week (Adjusted with local laws) across 5 days per week covering day hours in American timezones during weekends and 3 weekdays.
Experience Requirements:
1. Strong experience working in AWS-hosted environments.
2. Strong experience in supporting production workloads and firefighting.
3. Strong knowledge of SRE best practices and common issues.
4. Strong experience working with system monitoring tools.
5. Strong understanding and experience with distributed databases.
6. Solid understanding of Linux and Networking fundamentals.
7. Solid background in back-end development, including API usage and creation.
8. Solid knowledge of Security for network and containers.
9. Solid understanding in container orchestration, with a particular emphasis on Kubernetes.
10. Solid experience in managing Relational and Non-relational databases, including backup and restore operations.
11. Familiarity in automation/configuration management tools, preferably CDK and/or Terraform.
Responsibilities:
1. Design, build, and maintain the Goodnotes infrastructure, ensure it adheres to Dickerson’s Hierarchy of Reliability.
2. Design, refine, and execute new and existing playbooks.
3. Educate the various teams in SRE best practices. Aid them, from designing, capacity planning, to rolling out new features.
4. Be the go-to person for higher-level escalation for applications.
5. Improve existing SLAs, and optimise latency and error rates.
6. Improve the system monitoring, health reporting, and logging.
7. Design and implement security, assist in maintaining information security practices and procedures.
8. Participate in on-call rotation during the Americas Timezone UTC-8 to UTC-5.
9. Open to working 5 shifts a week which may include weekends.
Replicant's conversational AI platform automates customer service, resolving up to 80% of calls with natural, intelligent AI agents.
Benefits:
1. Health insurance (health, dental, eye care, retirement).
Software Engineer, Site Reliability Engineering.
Wayve's AI Driver is a data-driven, mapless, and universally compatible self-driving technology focusing on safety and advanced human-like capabilities.
Upstage AI provides powerful LLMs (Solar Pro/Mini) and Document AI tools for task automation and enhanced productivity, offering flexible pricing and a remote-friendly work environment.
Support for work environment setup expenses:
1. Support for office space usage fees.
2. Support for self-development expenses (seminars, workshops, books, software).
Site Reliability Engineer/Linux System and Database Administrator at AIPRM. AIPRM is a prompt management tool and community-driven prompt library for ChatGPT and other AI models.
Experience Requirements:
1. Minimum of 5 years' experience engineering automated systems for extensive data processing, spanning development environments to production landscapes.
Other Requirements:
1. Profound understanding of operating systems, database systems, and networking fundamentals.
2. Ability to independently tackle and probe infrastructure issues in live production setups, inclusive of hardware complications and liaising with data centers.
3. Hands-on experience with bare-metal servers.
4. Active participation in on-call rosters.
Experience Requirements:
1. Comprehensive experience with Linux, Networking, Databases, and SQL.
2. Exceptional English communication abilities, both in writing and verbally.
3. Demonstrated experience in data modeling, ideally within an AI-focused environment.
4. Hands-on experience designing and implementing data migration tools.
Responsibilities:
1. Continuously analyze and enhance system performance metrics.
2. Spearhead the specification, design, and implementation of cutting-edge data components and systems.
3. Proactively monitor, optimize, and test Clickhouse clusters to ensure optimal performance and reliability.
4. Review, refine, and optimize database backup & recovery processes.
5. Investigate and harden security aspects across all layers.
hCaptcha is a privacy-focused AI-powered CAPTCHA service that protects against bots and fraud.
Fully remote position with flexible working hours.
An inspiring team of colleagues spread all over the world.
Pleasant, modern development and deployment workflows: ship early, ship often.
High impact: lots of users, happy customers, high growth, and cutting-edge R&D.
Flat organization, direct interaction with customer teams.
Experience Requirements:
1. Minimum of six years of hands-on experience in related roles (engineering, DevOps, SRE).
Other Requirements:
1. Background in software engineering with expertise in backend development within Kubernetes-based systems.
2. Hands-on experience in development and orchestration within high-scale, high-uptime, and high-reliability environments.
3. Familiarity with distributed systems, including queue-first architectures and sharding.
4. Demonstrated engineering expertise, including gathering requirements, problem-solving, and making recommendations.
5. Preferred: Familiarity with security frameworks, attack vectors, botnets, and impact analysis.
Responsibilities:
1. Work with large-scale systems (handling millions of requests per second, serving millions of users, across multiple cloud providers).
2. Develop solutions to enhance performance, availability, security, and cost-effectiveness.
3. Keep us up, keep us fast, and keep our dev teams productive ensuring that every peer release improves performance across the spectrum including quality, security, uptime, speed-to-deliver, threat detection, and customer engagement.
4. Source improvement ideas, priority and capabilities from customers, the internal community, new and existing system metrics. Make decisions rapidly.
5. Be creative and desire an environment where you can directly create value and be a force to improve the experience for our customers.
#J-18808-Ljbffr