About LangChain
LangChain was founded in early 2023 to help developers build context-aware reasoning applications. LangChain, open source software, is a framework that gives developers the building blocks to create production-ready applications with LLMs. LangSmith is our commercial, all-in-one SaaS platform that enables the end-to-end development workflow for building LangChain and LLM-powered apps. LangSmith is now trusted by the best teams building with LLMs, at companies such as Airbnb, ByteDance, Klarna, Google, Meta, Home Depot, and our 50+ paying enterprise customers.
Backed by some of the best venture capitalists, Benchmark and Sequoia, we have 10x’d our revenue last year and have big ambitions and are set up to build an enduring business.
About the role
Location: Europe
1. Respond to incidents involving LangSmith and LangGraph Platform (Cloud SaaS) during GMT/CET on-call hours, diagnosing and resolving issues promptly.
2. Monitor, maintain, and improve the reliability and performance of our commercial products (LangSmith and LangGraph Platform).
3. Develop and automate solutions to reduce manual intervention in operations.
4. Continue to improve monitoring and alerting capabilities using observability solutions like Datadog.
How to be successful in this role
1. Bachelor’s degree in Computer Science, or related field.
2. 5+ years in Site Reliability Engineering, Infrastructure or a related field.
3. Strong knowledge of Kubernetes.
4. Strong knowledge of monitoring and observability tools (e.g., Datadog).
5. Strong knowledge of Redis and Postgres.
6. Programming experience, particularly in Python and Go.
7. Familiarity with ClickHouse and Google Cloud Platform is a strong plus.
8. Strong problem-solving skills and the ability to perform under pressure.
9. Excellent communication skills to collaborate with a distributed team.
J-18808-Ljbffr