仕事内容
<p>As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.</p>
<p>You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.</p>
<p><strong>Responsibilities</strong></p>
<ul>
<li>Participate in on-call rotation (Pagerduty) to respond to production incidents</li>
<li>Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users</li>
<li>Build monitoring systems to ensure the highest quality service for our customers</li>
<li>Design and implement operational processes (such as deployments and upgrades)</li>
<li>Debug production issues across all services and levels of the stack</li>
<li>Identify improvements for the product architecture from the reliability, performance and availability perspectives</li>
<li>Plan the growth of Together AI's infrastructure</li>
</ul>
<p><strong>Requirements</strong></p>
<ul>
<li>5+ years of professional AI Infra or related experience</li>
<li>Bachelor's degree in Computer Science or a related field or equivalent work experience</li>
<li>Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes</li>
<li>Proficiency in programming/scripting languages</li>
<li>Direct experience in monitoring and observability practices</li>
<li>Knowledge of cloud services</li>
<li>Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts</li>
</ul>
<p><strong>About Together AI</strong></p>
<p>Together AI is a research-driven artificial intelligence company. We believe open and transparent A