RESUME AND JOB

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

Tencent

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

Tencent

full-timePosted: Nov 19, 2025

Job Description

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

📋 Job Overview

The role of Hunyuan Large Model Infra Stability Expert involves leading the construction of high-availability infrastructure for Tencent's Hunyuan large model, ensuring core pipeline stability and supporting efficient large-scale training tasks. Responsibilities include designing observability platforms, developing intelligent fault detection systems, and optimizing task resumption mechanisms to enhance fault tolerance and resource utilization. The position requires tracking cutting-edge technologies and driving infrastructure evolution across Shenzhen, Beijing, Shanghai, and Hangzhou.

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

1.主导混元大模型基础设施高可用体系建设，负责核心链路稳定性战略规划与落地，定义稳定性 SLA 并推动达成，支撑大规模训练任务持续高效运行；
2.牵头框架、算力、网络、存储等跨模块技术协同，设计并落地全链路关键指标（metric）采集体系，构建覆盖训练全生命周期的可观测性平台，实现问题早发现、早定位；
3.主导智能化故障节点与慢节点检测平台研发，攻克大规模集群下节点异常识别、根因分析难题，建立自动化故障隔离与恢复机制，显著降低故障对训练任务的影响；
4.负责混元一站式平台核心能力 —— 任务自动续训体系的架构设计与技术突破，解决分布式训练状态一致性、断点续训效率优化等关键问题，提升任务容错能力与资源利用率；
5.作为技术专家响应并解决大模型训练中的复杂故障与性能瓶颈问题，沉淀故障处理方法论与最佳实践，形成技术资产并赋能团队；
6.跟踪行业前沿技术动态（如新型加速芯片、分布式训练框架、低延迟网络技术等），主导技术预研与落地，推动基础设施架构持续演进。

🎯 Key Responsibilities

Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

🛠️ Required Skills

Expertise in large-scale distributed systems and infrastructure stability
Knowledge of computing power, networks, storage, and cross-module collaboration
Experience in metrics collection, observability platforms, and fault detection
Skills in automated fault isolation, recovery mechanisms, and root cause analysis
Proficiency in distributed training, state consistency, and efficiency optimization
Ability to handle complex faults, performance bottlenecks, and best practices
Tracking and implementing frontier technologies like acceleration chips and low-latency networks

Locations

Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Expertise in large-scale distributed systems and infrastructure stabilityintermediate
Knowledge of computing power, networks, storage, and cross-module collaborationintermediate
Experience in metrics collection, observability platforms, and fault detectionintermediate
Skills in automated fault isolation, recovery mechanisms, and root cause analysisintermediate
Proficiency in distributed training, state consistency, and efficiency optimizationintermediate
Ability to handle complex faults, performance bottlenecks, and best practicesintermediate
Tracking and implementing frontier technologies like acceleration chips and low-latency networksintermediate

Responsibilities

Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

Target Your Resume for "混元大模型Infra稳定性专家（深圳/北京/上海/杭州）" , Tencent

Get personalized recommendations to optimize your resume specifically for 混元大模型Infra稳定性专家（深圳/北京/上海/杭州）. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "混元大模型Infra稳定性专家（深圳/北京/上海/杭州）" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 混元大模型Infra稳定性专家（深圳/北京/上海/杭州） @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

Tencent

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

Tencent

full-timePosted: Nov 19, 2025

Job Description

混元大模型Infra稳定性专家（深圳/北京/上海/杭州）

📋 Job Overview

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

🎯 Key Responsibilities

Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

🛠️ Required Skills

Expertise in large-scale distributed systems and infrastructure stability
Knowledge of computing power, networks, storage, and cross-module collaboration
Experience in metrics collection, observability platforms, and fault detection
Skills in automated fault isolation, recovery mechanisms, and root cause analysis
Proficiency in distributed training, state consistency, and efficiency optimization
Ability to handle complex faults, performance bottlenecks, and best practices
Tracking and implementing frontier technologies like acceleration chips and low-latency networks

Locations

Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Expertise in large-scale distributed systems and infrastructure stabilityintermediate
Knowledge of computing power, networks, storage, and cross-module collaborationintermediate
Experience in metrics collection, observability platforms, and fault detectionintermediate
Skills in automated fault isolation, recovery mechanisms, and root cause analysisintermediate
Proficiency in distributed training, state consistency, and efficiency optimizationintermediate
Ability to handle complex faults, performance bottlenecks, and best practicesintermediate
Tracking and implementing frontier technologies like acceleration chips and low-latency networksintermediate

Responsibilities

Lead the construction of high-availability system for Hunyuan large model infrastructure, responsible for core pipeline stability strategic planning and implementation, define stability SLA and promote achievement, support continuous efficient operation of large-scale training tasks.
Lead cross-module technical collaboration on frameworks, computing power, networks, storage, etc., design and implement full-link key metrics (metric) collection system, build an observability platform covering the entire training lifecycle, achieve early problem discovery and localization.
Lead the R&D of intelligent fault node and slow node detection platform, overcome challenges in node anomaly identification and root cause analysis under large-scale clusters, establish automated fault isolation and recovery mechanisms, significantly reduce the impact of faults on training tasks.
Responsible for the architecture design and technical breakthroughs of the core capability of Hunyuan one-stop platform — task automatic resumption system, solve key issues such as distributed training state consistency and checkpoint resumption efficiency optimization, enhance task fault tolerance and resource utilization.
As a technical expert, respond to and resolve complex faults and performance bottlenecks in large model training, precipitate fault handling methodologies and best practices, form technical assets and empower the team.
Track industry frontier technology dynamics (such as new acceleration chips, distributed training frameworks, low-latency network technologies, etc.), lead technical pre-research and implementation, drive continuous evolution of infrastructure architecture.

Target Your Resume for "混元大模型Infra稳定性专家（深圳/北京/上海/杭州）" , Tencent

Get personalized recommendations to optimize your resume specifically for 混元大模型Infra稳定性专家（深圳/北京/上海/杭州）. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "混元大模型Infra稳定性专家（深圳/北京/上海/杭州）" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 混元大模型Infra稳定性专家（深圳/北京/上海/杭州） @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap