Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate engineers to help achieve that mission.As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Cloud Hardware Systems Engineering (CHSE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware. We are looking for seasoned engineers with a dedicated passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will manage and optimize the Cloud infrastructure. We are looking for a Principal AI Network Architect to join the team.
Locations
Redmond, Washington, United States, Redmond, Washington, United States
Austin, Texas, United States, Austin, Texas, United States
San Diego, California, United States, San Diego, California, United States
Aliso Viejo, California, United States, Aliso Viejo, California, United States
Boise, Idaho, United States, Boise, Idaho, United States
Hillsboro, Oregon, United States, Hillsboro, Oregon, United States
Raleigh, North Carolina, United States, Raleigh, North Carolina, United States
Mountain View, California, United States, Mountain View, California, United States
Santa Clara, California, United States, Santa Clara, California, United States
San Jose, California, United States, San Jose, California, United States
Salary
Salary not disclosed
Required Qualifications
Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 8+ years technical engineering experienceOR Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 7+ years technical engineering experience OR equivalent experience (degree)
OR Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 7+ years technical engineering experience (degree)
OR equivalent experience (degree)
5+ years of experience in designing AI backend networks and integrating them into large-scale GPU systems. (degree)
Proven expertise in system architecture across compute, networking, and accelerator domains. (degree)
Deep understanding of RDMA protocols (RoCE, InfiniBand), congestion control (DCQCN), and Layer 2/3 routing. (degree)
Experience with optical interconnects (e.g., PSM, WDM), link budget analysis, and transceiver integration. (degree)
Familiarity with signal integrity modeling, link training, and physical layer optimization. (degree)
Experience architecting backend networks for AI training and Inference workloads, including Hamiltonian cycle traffic and collective operations (e.g., all-reduce, all-gather). (degree)
Hands-on design of high-radix switches (≥400Gbps per port), orthogonal chassis, and cabled backplanes. (degree)
Knowledge of chip-to-chip and chip-to-module interfaces, including error correction and equalization techniques. (degree)
Experience with custom NIC IPs and transport layers for secure, reliable packet delivery. (degree)
Familiarity with AI model execution pipelines and their impact on pod-level network design and latency SLAs. (degree)
Prior contributions to hyperscale deployments or cloud-scale AI infrastructure programs. (degree)
Preferred Qualifications
Proven expertise in system architecture across compute, networking, and accelerator domains. (degree)
Deep understanding of RDMA protocols (RoCE, InfiniBand), congestion control (DCQCN), and Layer 2/3 routing. (degree)
Experience with optical interconnects (e.g., PSM, WDM), link budget analysis, and transceiver integration. (degree)
Familiarity with signal integrity modeling, link training, and physical layer optimization. (degree)
Experience architecting backend networks for AI training and Inference workloads, including Hamiltonian cycle traffic and collective operations (e.g., all-reduce, all-gather). (degree)
Hands-on design of high-radix switches (≥400Gbps per port), orthogonal chassis, and cabled backplanes. (degree)
Knowledge of chip-to-chip and chip-to-module interfaces, including error correction and equalization techniques. (degree)
Experience with custom NIC IPs and transport layers for secure, reliable packet delivery. (degree)
Familiarity with AI model execution pipelines and their impact on pod-level network design and latency SLAs. (degree)
Prior contributions to hyperscale deployments or cloud-scale AI infrastructure programs. (degree)
Responsibilities
Technology Leadership Spearhead architectural definition and innovation for next-generation GPU and AI accelerator platforms, with a focus on ultra-high bandwidth, low-latency backend networks. Drive system-level integration across compute, storage, and interconnect domains to support scalable AI training workloads.
Cross-Functional Collaboration Partner with silicon, firmware, and datacenter engineering teams to co-design infrastructure that meets performance, reliability, and deployment goals. Influence platform decisions across rack, chassis, and pod-level implementations.
Technology Partnerships Cultivate deep technical relationships with silicon vendors, optics suppliers, and switch fabric providers to co-develop differentiated solutions. Represent Microsoft in joint architecture forums and technical workshops.
Architectural Clarity Evaluate and articulate tradeoffs across electrical, mechanical, thermal, and signal integrity domains. Frame decisions in terms of TCO, performance, scalability, and deployment risk. Lead design reviews and contribute to PRDs and system specifications.
Industry Influence Shape the direction of hyperscale AI infrastructure by engaging with standards bodies (e.g., IEEE 802.3), influencing component roadmaps, and driving adoption of novel interconnect protocols and topologies.