仕事内容
<h3><strong>About the Role</strong></h3>
<p>In this role, you will operate, scale, and optimize multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll manage and scale high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as Vast, Weka, Ceph, and Lustre, and solve the complex engineering challenges of operating at extreme throughput, low-latency data paths, and massive cluster-scale storage operations. </p>
<p>You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads. </p>
<h3><strong>Responsibilities</strong></h3>
<ul>
<li>Architect and implement the technical strategy and storage roadmap for Together AI, driving high-performance architectural decisions as we scale our GPU fleet.</li>
<li>Engineer and scale multi-petabyte AI/ML storage systems by integrating Vast, Weka, and Ceph while executing deep cost optimization through automated tiering and lifecycle policies.</li>
<li>Develop intelligent caching and tiered storage architectures to achieve extreme IOPS and cluster-wide throughput at GPU scale for training and inference workloads.</li>
<li>Tune storage isolation at the L2/L3 network layers to ensure secure, production-grade multi-tenancy for storage clients.</li>
<li>Code Kubernetes storage operators and controllers to enable automated provisioning, self-service abstractions, and quota enforcement.</li>
<li>Engineer end-to-end data paths to achieve 10+ GB/s per GPU node; architect multi-tier caching for model weights and datasets; tune parallel filesystems using advance