仕事内容
<div class="content-intro"><h2><strong>About Anthropic</strong></h2>
<p>Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p></div><h2><strong>About the role</strong></h2>
<p>Anthropic runs some of the largest Kubernetes clusters in the industry. We have fleets of hundreds of thousands of nodes across multiple cloud providers and datacenters to train, research, and serve frontier AI models. The Kubernetes Platform team owns the Kubernetes control plane that makes those clusters work.</p>
<p>We are operating at a scale where the defaults stop working. We own the scheduler and extend it to place topology-sensitive ML workloads across thousands of accelerators at once. We scale the control plane itself — apiserver, etcd, controllers — so it stays responsive as object counts and node counts grow by orders of magnitude. And we build the core cluster services every workload depends on, like service discovery, so they hold up under the same pressure.</p>
<p>We make sure the control plane is fast, correct, and always available. Your work will directly determine whether Anthropic can keep reliably and safely training frontier models as our compute footprint continues to grow.</p>
<h2><strong>Key responsibilities</strong></h2>
<ul>
<li>Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption</li>
<li>Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us</li>
<li>Design, build, and operate core cluster services such as service discovery that every workload in the fle
求めるスキル
Python
Kubernetes
AWS
GCP
Rust
C++