Resume and JobRESUME AND JOB
Tencent logo

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

Tencent

Software and Technology Jobs

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

full-timePosted: Nov 18, 2025

Job Description

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

📋 Job Overview

The Taiji GPU Intelligent Scheduling R&D Engineer role focuses on leading the development and optimization of large-scale GPU cluster scheduling systems to enhance resource utilization and efficiency in AI training tasks. Responsibilities include optimizing network, storage, and compute synergies, building high-availability frameworks using cloud-native technologies, and exploring advanced areas like hybrid cloud and heterogeneous computing. This position is based in Shenzhen, Beijing, Shanghai, or Hangzhou, supporting Tencent's cutting-edge distributed training infrastructure.

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

1.主导万卡级GPU集群的全局资源调度,通过精细化管理和优化策略,显著提升资源利用率,确保离线和在线任务的高效稳定运行;
2.深入优化RDMA高速网络、分布式存储与计算资源的协同调度,有效解决大规模训练任务中的性能瓶颈,提升整体计算效率;
3.基于Kubernetes、Docker等云原生技术,构建高可用调度框架,全面支持分布式训练框架,实现任务编排、容灾与混部能力,并深入K8s调度器、CSI插件及CRD的开发,推动大规模训推技术的实际落地;
4.积极探索混合云、虚拟化、ARM异构计算等前沿方向,不断推动。

🎯 Key Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and computing resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computing efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and co-location capabilities, while developing K8s schedulers, CSI plugins, and CRDs to drive the practical implementation of large-scale training and inference technologies.
  • Actively explore frontier directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously advance innovations.

🛠️ Required Skills

  • Expertise in GPU cluster resource scheduling and optimization
  • Knowledge of RDMA networks, distributed storage, and compute resource management
  • Proficiency in Kubernetes, Docker, and cloud-native technologies
  • Experience with distributed training frameworks, K8s schedulers, CSI plugins, and CRDs
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computing

Locations

  • Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in GPU cluster resource scheduling and optimizationintermediate
  • Knowledge of RDMA networks, distributed storage, and compute resource managementintermediate
  • Proficiency in Kubernetes, Docker, and cloud-native technologiesintermediate
  • Experience with distributed training frameworks, K8s schedulers, CSI plugins, and CRDsintermediate
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computingintermediate

Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and computing resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computing efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and co-location capabilities, while developing K8s schedulers, CSI plugins, and CRDs to drive the practical implementation of large-scale training and inference technologies.
  • Actively explore frontier directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously advance innovations.

Target Your Resume for "太极GPU智能调度研发工程师(深圳/北京/上海/杭州)" , Tencent

Get personalized recommendations to optimize your resume specifically for 太极GPU智能调度研发工程师(深圳/北京/上海/杭州). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "太极GPU智能调度研发工程师(深圳/北京/上海/杭州)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 太极GPU智能调度研发工程师(深圳/北京/上海/杭州) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Tencent logo

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

Tencent

Software and Technology Jobs

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

full-timePosted: Nov 18, 2025

Job Description

太极GPU智能调度研发工程师(深圳/北京/上海/杭州)

📋 Job Overview

The Taiji GPU Intelligent Scheduling R&D Engineer role focuses on leading the development and optimization of large-scale GPU cluster scheduling systems to enhance resource utilization and efficiency in AI training tasks. Responsibilities include optimizing network, storage, and compute synergies, building high-availability frameworks using cloud-native technologies, and exploring advanced areas like hybrid cloud and heterogeneous computing. This position is based in Shenzhen, Beijing, Shanghai, or Hangzhou, supporting Tencent's cutting-edge distributed training infrastructure.

📍 Location: Shenzhen, China

🏢 Business Unit: TEG

📄 Full Description

1.主导万卡级GPU集群的全局资源调度,通过精细化管理和优化策略,显著提升资源利用率,确保离线和在线任务的高效稳定运行;
2.深入优化RDMA高速网络、分布式存储与计算资源的协同调度,有效解决大规模训练任务中的性能瓶颈,提升整体计算效率;
3.基于Kubernetes、Docker等云原生技术,构建高可用调度框架,全面支持分布式训练框架,实现任务编排、容灾与混部能力,并深入K8s调度器、CSI插件及CRD的开发,推动大规模训推技术的实际落地;
4.积极探索混合云、虚拟化、ARM异构计算等前沿方向,不断推动。

🎯 Key Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and computing resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computing efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and co-location capabilities, while developing K8s schedulers, CSI plugins, and CRDs to drive the practical implementation of large-scale training and inference technologies.
  • Actively explore frontier directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously advance innovations.

🛠️ Required Skills

  • Expertise in GPU cluster resource scheduling and optimization
  • Knowledge of RDMA networks, distributed storage, and compute resource management
  • Proficiency in Kubernetes, Docker, and cloud-native technologies
  • Experience with distributed training frameworks, K8s schedulers, CSI plugins, and CRDs
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computing

Locations

  • Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in GPU cluster resource scheduling and optimizationintermediate
  • Knowledge of RDMA networks, distributed storage, and compute resource managementintermediate
  • Proficiency in Kubernetes, Docker, and cloud-native technologiesintermediate
  • Experience with distributed training frameworks, K8s schedulers, CSI plugins, and CRDsintermediate
  • Familiarity with hybrid cloud, virtualization, and ARM heterogeneous computingintermediate

Responsibilities

  • Lead global resource scheduling for 10,000-card GPU clusters through refined management and optimization strategies to significantly improve resource utilization and ensure efficient, stable operation of offline and online tasks.
  • Deeply optimize the collaborative scheduling of RDMA high-speed networks, distributed storage, and computing resources to effectively resolve performance bottlenecks in large-scale training tasks and enhance overall computing efficiency.
  • Build high-availability scheduling frameworks based on Kubernetes, Docker, and other cloud-native technologies to fully support distributed training frameworks, enabling task orchestration, disaster recovery, and co-location capabilities, while developing K8s schedulers, CSI plugins, and CRDs to drive the practical implementation of large-scale training and inference technologies.
  • Actively explore frontier directions such as hybrid cloud, virtualization, and ARM heterogeneous computing to continuously advance innovations.

Target Your Resume for "太极GPU智能调度研发工程师(深圳/北京/上海/杭州)" , Tencent

Get personalized recommendations to optimize your resume specifically for 太极GPU智能调度研发工程师(深圳/北京/上海/杭州). Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "太极GPU智能调度研发工程师(深圳/北京/上海/杭州)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShenzhenChinaTEGTEG

Answer 10 quick questions to check your fit for 太极GPU智能调度研发工程师(深圳/北京/上海/杭州) @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.