Resume and JobRESUME AND JOB
Tencent logo

高性能计算工程师-(深圳)or(北京)or

Tencent

Software and Technology Jobs

高性能计算工程师-(深圳)or(北京)or

full-timePosted: Dec 9, 2025

Job Description

高性能计算工程师-(深圳)or(北京)or

📋 Job Overview

Tencent is seeking a High-Performance Computing Engineer to lead advanced optimizations for large-scale LLMs in Shenzhen or Beijing. The role focuses on developing extreme performance techniques for billion-parameter models, including kernel optimizations for inference frameworks like vLLM and TensorRT-LLM. Responsibilities also encompass low-bit quantization, unified multi-modal architectures, heterogeneous hardware adaptations, and innovative operator designs to push LLM inference to hardware limits.

📍 Location: Shanghai, China

🏢 Business Unit: CSIG

📄 Full Description

1.超大规模LLM性能工程: 主导并规划千亿参数级大模型的极致性能优化技术路线。负责 PagedAttention、连续批处理等核心调度策略的深度定制与生产级架构设计,负责 vLLM/TensorRT-LLM 等主流推理框架的内核级优化与落地;
2.低比特与稀疏模型优化: 牵头 INT4/FP8/AWQ 等前沿低比特量化技术的工业级系统化落地,平衡精度与计算效率。并设计面向 MoE 模型的分布式调度、路由、显存管理及跨卡通信的优化方案;
3.统一与多模态架构: 定义并设计一套具备长期扩展性的统一 AI 推理引擎架构,以支撑自回归生成任务,并前瞻性地解决多模态大模型(如视觉-语言模型)的协同推理部署挑战;
4.异构算力与国产化适配: 主导推理引擎在国产AI芯片(如昇腾、海光、天数等)平台上的战略级移植、生态适配与性能优化。对 HCCL/NCCL 等通信原语进行深度优化和定制,实现跨异构架构的算力自主可控;
5.核心算子优化与指令架构创新 (Enhanced Focus):深度介入 GPU/NPU 硬件底层,主导设计和实现LLM特有高性能算子。 重点包括:高性能Attention Kernel、矩阵乘法(GEMM)的深度定制与融合、KV Cache读写优化等关键算子;
6.具备深入理解和利用硬件指令集架构(ISA)和微架构(Microarchitecture)的能力, 通过 CUDA/Triton 或国产芯片底层编程语言,进行SIMD/SIMT指令优化、指令级并行(ILP)及寄存器重用等,将LLM推理性能推向硬件理论极限。

🎯 Key Responsibilities

  • Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
  • Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
  • Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
  • Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
  • Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
  • Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
  • Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
  • Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
  • Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
  • Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
  • Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

  • Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
  • Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages

🛠️ Required Skills

  • Expertise in PagedAttention and continuous batch processing
  • Proficiency in vLLM and TensorRT-LLM inference frameworks
  • Knowledge of INT4/FP8/AWQ low-bit quantization techniques
  • Experience with MoE model distributed scheduling and optimization
  • Skills in multi-modal AI inference deployment
  • Adaptation to domestic AI chips like Ascend, Hygon, TianShu
  • Optimization of HCCL/NCCL communication primitives
  • Design of high-performance operators like Attention Kernel and GEMM
  • Hardware-level programming with CUDA/Triton or equivalent for domestic chips
  • SIMD/SIMT instruction optimization and ILP

Locations

  • Shanghai, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in PagedAttention and continuous batch processingintermediate
  • Proficiency in vLLM and TensorRT-LLM inference frameworksintermediate
  • Knowledge of INT4/FP8/AWQ low-bit quantization techniquesintermediate
  • Experience with MoE model distributed scheduling and optimizationintermediate
  • Skills in multi-modal AI inference deploymentintermediate
  • Adaptation to domestic AI chips like Ascend, Hygon, TianShuintermediate
  • Optimization of HCCL/NCCL communication primitivesintermediate
  • Design of high-performance operators like Attention Kernel and GEMMintermediate
  • Hardware-level programming with CUDA/Triton or equivalent for domestic chipsintermediate
  • SIMD/SIMT instruction optimization and ILPintermediate

Required Qualifications

  • Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
  • Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages (experience)

Responsibilities

  • Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
  • Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
  • Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
  • Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
  • Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
  • Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
  • Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
  • Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
  • Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
  • Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
  • Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-(深圳)or(北京)or. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShanghaiChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-(深圳)or(北京)or @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Tencent logo

高性能计算工程师-(深圳)or(北京)or

Tencent

Software and Technology Jobs

高性能计算工程师-(深圳)or(北京)or

full-timePosted: Dec 9, 2025

Job Description

高性能计算工程师-(深圳)or(北京)or

📋 Job Overview

Tencent is seeking a High-Performance Computing Engineer to lead advanced optimizations for large-scale LLMs in Shenzhen or Beijing. The role focuses on developing extreme performance techniques for billion-parameter models, including kernel optimizations for inference frameworks like vLLM and TensorRT-LLM. Responsibilities also encompass low-bit quantization, unified multi-modal architectures, heterogeneous hardware adaptations, and innovative operator designs to push LLM inference to hardware limits.

📍 Location: Shanghai, China

🏢 Business Unit: CSIG

📄 Full Description

1.超大规模LLM性能工程: 主导并规划千亿参数级大模型的极致性能优化技术路线。负责 PagedAttention、连续批处理等核心调度策略的深度定制与生产级架构设计,负责 vLLM/TensorRT-LLM 等主流推理框架的内核级优化与落地;
2.低比特与稀疏模型优化: 牵头 INT4/FP8/AWQ 等前沿低比特量化技术的工业级系统化落地,平衡精度与计算效率。并设计面向 MoE 模型的分布式调度、路由、显存管理及跨卡通信的优化方案;
3.统一与多模态架构: 定义并设计一套具备长期扩展性的统一 AI 推理引擎架构,以支撑自回归生成任务,并前瞻性地解决多模态大模型(如视觉-语言模型)的协同推理部署挑战;
4.异构算力与国产化适配: 主导推理引擎在国产AI芯片(如昇腾、海光、天数等)平台上的战略级移植、生态适配与性能优化。对 HCCL/NCCL 等通信原语进行深度优化和定制,实现跨异构架构的算力自主可控;
5.核心算子优化与指令架构创新 (Enhanced Focus):深度介入 GPU/NPU 硬件底层,主导设计和实现LLM特有高性能算子。 重点包括:高性能Attention Kernel、矩阵乘法(GEMM)的深度定制与融合、KV Cache读写优化等关键算子;
6.具备深入理解和利用硬件指令集架构(ISA)和微架构(Microarchitecture)的能力, 通过 CUDA/Triton 或国产芯片底层编程语言,进行SIMD/SIMT指令优化、指令级并行(ILP)及寄存器重用等,将LLM推理性能推向硬件理论极限。

🎯 Key Responsibilities

  • Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
  • Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
  • Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
  • Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
  • Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
  • Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
  • Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
  • Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
  • Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
  • Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
  • Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

  • Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
  • Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages

🛠️ Required Skills

  • Expertise in PagedAttention and continuous batch processing
  • Proficiency in vLLM and TensorRT-LLM inference frameworks
  • Knowledge of INT4/FP8/AWQ low-bit quantization techniques
  • Experience with MoE model distributed scheduling and optimization
  • Skills in multi-modal AI inference deployment
  • Adaptation to domestic AI chips like Ascend, Hygon, TianShu
  • Optimization of HCCL/NCCL communication primitives
  • Design of high-performance operators like Attention Kernel and GEMM
  • Hardware-level programming with CUDA/Triton or equivalent for domestic chips
  • SIMD/SIMT instruction optimization and ILP

Locations

  • Shanghai, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Expertise in PagedAttention and continuous batch processingintermediate
  • Proficiency in vLLM and TensorRT-LLM inference frameworksintermediate
  • Knowledge of INT4/FP8/AWQ low-bit quantization techniquesintermediate
  • Experience with MoE model distributed scheduling and optimizationintermediate
  • Skills in multi-modal AI inference deploymentintermediate
  • Adaptation to domestic AI chips like Ascend, Hygon, TianShuintermediate
  • Optimization of HCCL/NCCL communication primitivesintermediate
  • Design of high-performance operators like Attention Kernel and GEMMintermediate
  • Hardware-level programming with CUDA/Triton or equivalent for domestic chipsintermediate
  • SIMD/SIMT instruction optimization and ILPintermediate

Required Qualifications

  • Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
  • Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages (experience)

Responsibilities

  • Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
  • Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
  • Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
  • Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
  • Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
  • Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
  • Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
  • Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
  • Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
  • Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
  • Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-(深圳)or(北京)or. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

TencentShanghaiChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-(深圳)or(北京)or @ Tencent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.