Resume and JobRESUME AND JOB
Tencent logo

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

Tencent

Software and Technology Jobs

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

full-timePosted: Nov 19, 2025

Job Description

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

📋 Job Overview

The role of Hunyuan Large Model Infra Stability Expert involves leading the construction of high-availability infrastructure for Tencent's Hunyuan large model, ensuring core pipeline stability and supporting efficient large-scale training tasks. Responsibilities include designing observability platforms, developing intelligent fault detection systems, and optimizing task resumption mechanisms to enhance fault tolerance and resource utilization. The position requires tracking cutting-edge technologies and driving infrastructure evolution across Shenzhen, Beijing, Shanghai, and Hangzhou.

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

1.主导混元大模型基础设施高可用体系建设,负责核心链路稳定性战略规划与落地,定义稳定性 SLA 并推动达成,支撑大规模训练任务持续高效运行;
2.牵头框架、算力、网络、存储等跨模块技术协同,设计并落地全链路关键指标(metric)采集体系,构建覆盖训练全生命周期的可观测性平台,实现问题早发现、早定位;
3.主导智能化故障节点与慢节点检测平台研发,攻克大规模集群下节点异常识别、根因分析难题,建立自动化故障隔离与恢复机制,显著降低故障对训练任务的影响;
4.负责混元一站式平台核心能力 —— 任务自动续训体系的架构设计与技术突破,解决分布式训练状态一致性、断点续训效率优化等关键问题,提升任务容错能力与资源利用率;
5.作为技术专家响应并解决大模型训练中的复杂故障与性能瓶颈问题,沉淀故障处理方法论与最佳实践,形成技术资产并赋能团队;
6.跟踪行业前沿技术动态(如新型加速芯片、分布式训练框架、低延迟网络技术等),主导技术预研与落地,推动基础设施架构持续演进。

🎯 Key Responsibilities

  • Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
  • Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
  • Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
  • Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
  • As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
  • Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

🛠️ Required Skills

  • Expertise in large-scale distributed systems and infrastructure stability
  • Knowledge of computing power, networks, storage, and cross-module collaboration
  • Experience in metrics collection, observability platforms, and fault detection
  • Skills in automated fault isolation, recovery mechanisms, and root cause analysis
  • Proficiency in distributed training, state consistency, and efficiency optimization
  • Ability to handle complex faults, performance bottlenecks, and best practices
  • Tracking and implementing frontier technologies like acceleration chips and low-latency networks

Locations

  • Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in large-scale distributed systems and infrastructure stabilityintermediate
  • Knowledge of computing power, networks, storage, and cross-module collaborationintermediate
  • Experience in metrics collection, observability platforms, and fault detectionintermediate
  • Skills in automated fault isolation, recovery mechanisms, and root cause analysisintermediate
  • Proficiency in distributed training, state consistency, and efficiency optimizationintermediate
  • Ability to handle complex faults, performance bottlenecks, and best practicesintermediate
  • Tracking and implementing frontier technologies like acceleration chips and low-latency networksintermediate

Responsibilities

  • Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
  • Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
  • Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
  • Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
  • As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
  • Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

Target Your Resume for "混元大模型Infra稳定性专家(深圳/北京/上海/杭州)" , Tencent

Get personalized recommendations to optimize your resume specifically for 混元大模型Infra稳定性专家(深圳/北京/上海/杭州). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "混元大模型Infra稳定性专家(深圳/北京/上海/杭州)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 混元大模型Infra稳定性专家(深圳/北京/上海/杭州) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Tencent logo

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

Tencent

Software and Technology Jobs

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

full-timePosted: Nov 19, 2025

Job Description

混元大模型Infra稳定性专家(深圳/北京/上海/杭州)

📋 Job Overview

The role of Hunyuan Large Model Infra Stability Expert involves leading the construction of high-availability infrastructure for Tencent's Hunyuan large model, ensuring core pipeline stability and supporting efficient large-scale training tasks. Responsibilities include designing observability platforms, developing intelligent fault detection systems, and optimizing task resumption mechanisms to enhance fault tolerance and resource utilization. The position requires tracking cutting-edge technologies and driving infrastructure evolution across Shenzhen, Beijing, Shanghai, and Hangzhou.

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

1.主导混元大模型基础设施高可用体系建设,负责核心链路稳定性战略规划与落地,定义稳定性 SLA 并推动达成,支撑大规模训练任务持续高效运行;
2.牵头框架、算力、网络、存储等跨模块技术协同,设计并落地全链路关键指标(metric)采集体系,构建覆盖训练全生命周期的可观测性平台,实现问题早发现、早定位;
3.主导智能化故障节点与慢节点检测平台研发,攻克大规模集群下节点异常识别、根因分析难题,建立自动化故障隔离与恢复机制,显著降低故障对训练任务的影响;
4.负责混元一站式平台核心能力 —— 任务自动续训体系的架构设计与技术突破,解决分布式训练状态一致性、断点续训效率优化等关键问题,提升任务容错能力与资源利用率;
5.作为技术专家响应并解决大模型训练中的复杂故障与性能瓶颈问题,沉淀故障处理方法论与最佳实践,形成技术资产并赋能团队;
6.跟踪行业前沿技术动态(如新型加速芯片、分布式训练框架、低延迟网络技术等),主导技术预研与落地,推动基础设施架构持续演进。

🎯 Key Responsibilities

  • Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
  • Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
  • Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
  • Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
  • As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
  • Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

🛠️ Required Skills

  • Expertise in large-scale distributed systems and infrastructure stability
  • Knowledge of computing power, networks, storage, and cross-module collaboration
  • Experience in metrics collection, observability platforms, and fault detection
  • Skills in automated fault isolation, recovery mechanisms, and root cause analysis
  • Proficiency in distributed training, state consistency, and efficiency optimization
  • Ability to handle complex faults, performance bottlenecks, and best practices
  • Tracking and implementing frontier technologies like acceleration chips and low-latency networks

Locations

  • Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in large-scale distributed systems and infrastructure stabilityintermediate
  • Knowledge of computing power, networks, storage, and cross-module collaborationintermediate
  • Experience in metrics collection, observability platforms, and fault detectionintermediate
  • Skills in automated fault isolation, recovery mechanisms, and root cause analysisintermediate
  • Proficiency in distributed training, state consistency, and efficiency optimizationintermediate
  • Ability to handle complex faults, performance bottlenecks, and best practicesintermediate
  • Tracking and implementing frontier technologies like acceleration chips and low-latency networksintermediate

Responsibilities

  • Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
  • Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
  • Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
  • Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
  • As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
  • Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

Target Your Resume for "混元大模型Infra稳定性专家(深圳/北京/上海/杭州)" , Tencent

Get personalized recommendations to optimize your resume specifically for 混元大模型Infra稳定性专家(深圳/北京/上海/杭州). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "混元大模型Infra稳定性专家(深圳/北京/上海/杭州)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 混元大模型Infra稳定性专家(深圳/北京/上海/杭州) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.