Software Engineer: Network (C++)
Palo Alto, CA; Seattle, WA·Posted today
aiinfrastructureml
<div class="content-intro"><h3><strong><span style="font-family: arial, helvetica, sans-serif;">ABOUT xAI</span></strong></h3> <p><span style="font-family: arial, helvetica, sans-serif;">xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. </span><span style="font-family: arial, helvetica, sans-serif;">Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. </span><span style="font-family: arial, helvetica, sans-serif;">We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. </span><span style="font-family: arial, helvetica, sans-serif;">All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.</span></p></div><p><strong>SOFTWARE ENGINEER, NETWORK C++ (COLOSSUS)</strong></p> <p>At xAI, we design, build, and operate Colossus from the ground up. This includes the massive GPU clusters, high-speed interconnect fabric, and the software that makes it all work at unprecedented scale. Colossus powers Grok and our frontier AI models with a custom, high-performance datacenter network that delivers ultra-low latency and massive bandwidth across hundreds of thousands of GPUs.</p> <p>As a Software Engineer on the Colossus Networking team, you will develop the core networking software that maximizes the performance and reliability of our datacenter fabric. Your work will directly impact training efficiency, model convergence, and the speed at which we can push the frontier of AI.</p> <p>Our engineers own the full lifecycle of their software — from design and implementation to deployment, monitoring, and iteration based on real-world performance at scale. You will solve hard problems in distributed systems, high-performance networking, and real-time control of one of the largest AI supercomputers on Earth.</p> <p><strong>RESPONSIBILITIES:</strong></p> <ul> <li>Develop routing and traffic-engineering algorithms for the Colossus high-performance datacenter network.</li> <li>Develop highly reliable, real-time software designed to run on the switches that form the backbone of our low-latency, high-bandwidth AI training fabric.</li> <li>Participate in and lead architecture, design, and code reviews.</li> <li>Develop prototypes and run experiments to validate key design decisions at both small and full-cluster scale.</li> <li>Build tools for software development, deployment, data analysis, visualization, and testing across virtualized environments, hardware-in-the-loop setups, and live production clusters.</li> <li>Deploy reliable software updates through continuous integration and release systems with rigorous testing and monitoring.</li> </ul> <p><strong>BASIC QUALIFICATIONS:</strong></p> <ul> <li>Bachelor’s degree in computer science, engineering, math, or a related technical discipline; OR 2+ years of professional software development experience in lieu of a degree.</li> <li>Strong development experience in C or C++.</li> </ul> <p><strong>PREFERRED SKILLS AND EXPERIENCE:</strong></p> <ul> <li>Strong professional experience writing high-performance C/C++ in production environments.</li> <li>Experience developing, debugging, and deploying software that runs at scale in real-world systems.</li> <li>Deep knowledge of networking protocols (UDP, TCP/IP, RDMA, etc.), distributed systems, and large-scale datacenter fabrics.</li> <li>Background in real-time systems, high-performance computing, low-latency networking, or resource-constrained environments.</li> <li>Creative problem-solving ability with exceptional analytical skills and strong engineering fundamentals.</li> <li>Excellent written and verbal communication skills.</li> <li>Ability to thrive in a fast-paced, dynamic environment with evolving requirements.</li> <li>Experience with security considerations in large-scale distributed systems.</li> </ul> <p><strong>ADDITIONAL REQUIREMENTS:</strong></p> <ul> <li>Must be willing to work extended hours and weekends as needed.</li> </ul><div class="content-conclusion"><p><em>xAI is an equal opportunity employer. For details on data processing, view our </em><em><a href="https://x.ai/legal/recruitment-privacy-notice" target="_blank">Recruitment Privacy Notice</a>.</em></p></div>