仕事内容
<p>As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world’s most advanced models.</p>
<p>The ideal candidate is a systems expert who thrives on solving the orchestration, networking, and reliability challenges that emerge at massive scale. You will partner closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI.</p>
<h2>You will:</h2>
<ul>
<li>Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.</li>
<li>Design and implement scheduling primitives to optimize the lifecycle of training jobs.</li>
<li>Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures</li>
<li>Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.</li>
<li>Work closely with Finance and Procurement teams to drive our capacity planning process.</li>
<li>Participate in our team’s on call process to ensure the availability of our services.</li>
<li>Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.</li>
</ul>
<h2>Ideally you'd have:</h2>
<ul>
<li>5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).</li>
<li>Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)</li>
<li
求めるスキル
Python
PyTorch
CUDA
Kubernetes
AWS
GCP
Rust
C++