Senior AI Hardware Engineer

Microsoft

full-time

Posted: October 10, 2025

Number of Vacancies: 1

Job Description

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide. As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware. We are looking for a Senior AI Hardware Engineer with a dedicated passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will manage and optimize the Cloud infrastructure. We are looking for a Senior AI Hardware Engineer to join the team.

Locations

Redmond, Washington, United States, Redmond, Washington, United States
Austin, Texas, United States, Austin, Texas, United States
Aliso Viejo, California, United States, Aliso Viejo, California, United States
Boise, Idaho, United States, Boise, Idaho, United States
Raleigh, North Carolina, United States, Raleigh, North Carolina, United States
Redmond, Washington, United States, Redmond, Washington, United States
Mountain View, California, United States, Mountain View, California, United States
San Jose, California, United States, San Jose, California, United States
Sunnyvale, California, United States, Sunnyvale, California, United States
San Diego, California, United States, San Diego, California, United States

Salary

Salary not disclosed

Required Qualifications

Master's Degree in Electrical Engineering, Computer Engineering, or related field AND 3+ years technical engineering experience OR Bachelor's Degree in Electrical Engineering, Computer Engineering, or related field AND 5+ years technical engineering experienceOR equivalent experience. (degree)
OR equivalent experience. (degree)
4+ years of work experience in managing product quality in the electronic industry. (degree)
4+ years of direct engineering experience in hardware system issue resolution for GPU Servers. (degree)
Versed in filtering through applicable debug data, like telemetry and logs to identify and investigate hardware failure signatures. (degree)
Bachelor's Degree in electrical and systems engineering, or related field AND 7+ years experience in a large scale manufacturing and/or data center environment/repair OR Master's Degree in electrical, systems engineering, or related field AND 6+ years experience in a complex manufacturing environmentOR 9+ years equivalent experience. (degree)
OR 9+ years equivalent experience. (degree)
Patent or track record of engineering excellency. (degree)
Experience with liquid cooling systems in data centers (degree)
12+ years of experience in working with the modern server architectures – includes understanding of GPU, CPU methods for failure analysis, debugging or validation. (degree)
8+ years of system level server debugging with an understanding of platform, power, system and network environments (degree)
3+ years of direct GPU related engineering experience in issue debug/test log review. (degree)
Leadership skills and ability to collaborate with diverse teams and drive a call to action. (degree)
Proficent of root cause analysis and corrective action methods to identify contributing factors of production defects. (degree)
Ability to analyze large data sets, extract key insights, and effectively present and communicate the results. (degree)

Preferred Qualifications

Bachelor's Degree in electrical and systems engineering, or related field AND 7+ years experience in a large scale manufacturing and/or data center environment/repair OR Master's Degree in electrical, systems engineering, or related field AND 6+ years experience in a complex manufacturing environmentOR 9+ years equivalent experience. (degree)
OR 9+ years equivalent experience. (degree)
Patent or track record of engineering excellency. (degree)
Experience with liquid cooling systems in data centers (degree)
12+ years of experience in working with the modern server architectures – includes understanding of GPU, CPU methods for failure analysis, debugging or validation. (degree)
8+ years of system level server debugging with an understanding of platform, power, system and network environments (degree)
3+ years of direct GPU related engineering experience in issue debug/test log review. (degree)
Leadership skills and ability to collaborate with diverse teams and drive a call to action. (degree)
Proficent of root cause analysis and corrective action methods to identify contributing factors of production defects. (degree)
Ability to analyze large data sets, extract key insights, and effectively present and communicate the results. (degree)

Responsibilities

Develop and implement a robust supplier quality management strategy to ensure the data center hardware is manufactured at the highest level of quality standards.
Lead quality issues and improvement task force to contain, mitigate, and resolve the top-quality issues impacting global data centers.
Conduct debug and failure analysis for GPU subsystems in the Azure fleet and drive resolution with partners and suppliers.
Drive the continuous improvement process based on Root Cause Analysis (RCA) and identified opportunities.
Responsible for quality readouts based on your telemetry data analysis, to bring clarity on status, actions across the organization and next steps for issue resolution.
Establish critical-to-quality performance metrics to measure and improve product quality.
Act as the voice of quality in the hardware change management process, ensuring quality requirements are considered, met and improved.

Travel Requirements

3 days / week in-office

Documents

Document (url)

Privacy Terms & Conditions About Us Refund Policy Recruiter Login