Resume and JobRESUME AND JOB
Tencent logo

机器学习平台调度工程师​​(北京/深圳)

Tencent

Software and Technology Jobs

机器学习平台调度工程师​​(北京/深圳)

full-timePosted: Nov 26, 2025

Job Description

机器学习平台调度工程师​​(北京/深圳)

📋 Job Overview

The Machine Learning Platform Scheduling Engineer role at Tencent involves leading the global resource scheduling for large-scale GPU clusters to optimize resource utilization and ensure efficient operation of offline and online tasks. The position focuses on enhancing performance through optimizations in RDMA networks, distributed storage, and compute resources, while building high-availability scheduling frameworks using Kubernetes and Docker. Responsibilities also include exploring advanced technologies like hybrid cloud, virtualization, and ARM heterogeneous computing to drive platform innovation.

📍 Location: Beijing, China

🏢 Business Unit: TEG

📄 Full Description

1.主导万卡级GPU集群的全局资源调度,通过精细化管理和优化策略,显著提升资源利用率,确保离线和在线任务的高效稳定运行;
2.深入优化RDMA高速网络、分布式存储与计算资源的协同调度,有效解决大规模训练任务中的性能瓶颈,提升整体计算效率;
3.基于Kubernetes、Docker等云原生技术,构建高可用调度框架,全面支持分布式训练框架,实现任务编排、容灾与混部能力,并深入K8s调度器、CSI插件及CRD的开发,推动大规模训推技术的实际落地;
4.积极探索混合云、虚拟化、ARM异构计算等前沿方向,不断推动技术与平台能力的升级和创新。

🎯 Key Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and compute resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computational efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and mixed deployment capabilities, while developing K8s schedulers, CSI plugins, and CRDs to promote the practical implementation of large-scale training and inference technologies.
  • Actively explore cutting-edge directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously drive upgrades and innovations in technology and platform capabilities.

🛠️ Required Skills

  • Expertise in GPU cluster resource scheduling and optimization
  • Knowledge of RDMA high-speed networks and distributed storage
  • Proficiency in Kubernetes, Docker, and cloud-native technologies
  • Experience with distributed training frameworks and K8s components (schedulers, CSI plugins, CRDs)
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computing

Locations

  • Beijing, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in GPU cluster resource scheduling and optimizationintermediate
  • Knowledge of RDMA high-speed networks and distributed storageintermediate
  • Proficiency in Kubernetes, Docker, and cloud-native technologiesintermediate
  • Experience with distributed training frameworks and K8s components (schedulers, CSI plugins, CRDs)intermediate
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computingintermediate

Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and compute resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computational efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and mixed deployment capabilities, while developing K8s schedulers, CSI plugins, and CRDs to promote the practical implementation of large-scale training and inference technologies.
  • Actively explore cutting-edge directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously drive upgrades and innovations in technology and platform capabilities.

Target Your Resume for "机器学习平台调度工程师​​(北京/深圳)" , Tencent

Get personalized recommendations to optimize your resume specifically for 机器学习平台调度工程师​​(北京/深圳). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "机器学习平台调度工程师​​(北京/深圳)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentBeijingChinaTEGTEG

Answer 10 quick questions to check your fit for 机器学习平台调度工程师​​(北京/深圳) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Tencent logo

机器学习平台调度工程师​​(北京/深圳)

Tencent

Software and Technology Jobs

机器学习平台调度工程师​​(北京/深圳)

full-timePosted: Nov 26, 2025

Job Description

机器学习平台调度工程师​​(北京/深圳)

📋 Job Overview

The Machine Learning Platform Scheduling Engineer role at Tencent involves leading the global resource scheduling for large-scale GPU clusters to optimize resource utilization and ensure efficient operation of offline and online tasks. The position focuses on enhancing performance through optimizations in RDMA networks, distributed storage, and compute resources, while building high-availability scheduling frameworks using Kubernetes and Docker. Responsibilities also include exploring advanced technologies like hybrid cloud, virtualization, and ARM heterogeneous computing to drive platform innovation.

📍 Location: Beijing, China

🏢 Business Unit: TEG

📄 Full Description

1.主导万卡级GPU集群的全局资源调度,通过精细化管理和优化策略,显著提升资源利用率,确保离线和在线任务的高效稳定运行;
2.深入优化RDMA高速网络、分布式存储与计算资源的协同调度,有效解决大规模训练任务中的性能瓶颈,提升整体计算效率;
3.基于Kubernetes、Docker等云原生技术,构建高可用调度框架,全面支持分布式训练框架,实现任务编排、容灾与混部能力,并深入K8s调度器、CSI插件及CRD的开发,推动大规模训推技术的实际落地;
4.积极探索混合云、虚拟化、ARM异构计算等前沿方向,不断推动技术与平台能力的升级和创新。

🎯 Key Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and compute resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computational efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and mixed deployment capabilities, while developing K8s schedulers, CSI plugins, and CRDs to promote the practical implementation of large-scale training and inference technologies.
  • Actively explore cutting-edge directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously drive upgrades and innovations in technology and platform capabilities.

🛠️ Required Skills

  • Expertise in GPU cluster resource scheduling and optimization
  • Knowledge of RDMA high-speed networks and distributed storage
  • Proficiency in Kubernetes, Docker, and cloud-native technologies
  • Experience with distributed training frameworks and K8s components (schedulers, CSI plugins, CRDs)
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computing

Locations

  • Beijing, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in GPU cluster resource scheduling and optimizationintermediate
  • Knowledge of RDMA high-speed networks and distributed storageintermediate
  • Proficiency in Kubernetes, Docker, and cloud-native technologiesintermediate
  • Experience with distributed training frameworks and K8s components (schedulers, CSI plugins, CRDs)intermediate
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computingintermediate

Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and compute resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computational efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and mixed deployment capabilities, while developing K8s schedulers, CSI plugins, and CRDs to promote the practical implementation of large-scale training and inference technologies.
  • Actively explore cutting-edge directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously drive upgrades and innovations in technology and platform capabilities.

Target Your Resume for "机器学习平台调度工程师​​(北京/深圳)" , Tencent

Get personalized recommendations to optimize your resume specifically for 机器学习平台调度工程师​​(北京/深圳). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "机器学习平台调度工程师​​(北京/深圳)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentBeijingChinaTEGTEG

Answer 10 quick questions to check your fit for 机器学习平台调度工程师​​(北京/深圳) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.