RESUME AND JOB

高性能计算工程师-(深圳)or(北京)or

Tencent

高性能计算工程师-(深圳)or(北京)or

Tencent

full-timePosted: Dec 9, 2025

Job Description

高性能计算工程师-(深圳)or(北京)or

📋 Job Overview

Tencent is seeking a High-Performance Computing Engineer to lead advanced optimizations for large-scale LLMs in Shenzhen or Beijing. The role focuses on developing extreme performance techniques for billion-parameter models, including kernel optimizations for inference frameworks like vLLM and TensorRT-LLM. Responsibilities also encompass low-bit quantization, unified multi-modal architectures, heterogeneous hardware adaptations, and innovative operator designs to push LLM inference to hardware limits.

📍 Location: Shanghai, China

🏢 Business Unit: CSIG

📄 Full Description

1.超大规模LLM性能工程：主导并规划千亿参数级大模型的极致性能优化技术路线。负责 PagedAttention、连续批处理等核心调度策略的深度定制与生产级架构设计，负责 vLLM/TensorRT-LLM 等主流推理框架的内核级优化与落地；
2.低比特与稀疏模型优化：牵头 INT4/FP8/AWQ 等前沿低比特量化技术的工业级系统化落地，平衡精度与计算效率。并设计面向 MoE 模型的分布式调度、路由、显存管理及跨卡通信的优化方案；
3.统一与多模态架构：定义并设计一套具备长期扩展性的统一 AI 推理引擎架构，以支撑自回归生成任务，并前瞻性地解决多模态大模型（如视觉-语言模型）的协同推理部署挑战；
4.异构算力与国产化适配：主导推理引擎在国产AI芯片（如昇腾、海光、天数等）平台上的战略级移植、生态适配与性能优化。对 HCCL/NCCL 等通信原语进行深度优化和定制，实现跨异构架构的算力自主可控；
5.核心算子优化与指令架构创新 (Enhanced Focus):深度介入 GPU/NPU 硬件底层，主导设计和实现LLM特有高性能算子。重点包括：高性能Attention Kernel、矩阵乘法（GEMM）的深度定制与融合、KV Cache读写优化等关键算子；
6.具备深入理解和利用硬件指令集架构（ISA）和微架构（Microarchitecture）的能力，通过 CUDA/Triton 或国产芯片底层编程语言，进行SIMD/SIMT指令优化、指令级并行（ILP）及寄存器重用等，将LLM推理性能推向硬件理论极限。

🎯 Key Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages

🛠️ Required Skills

Expertise in PagedAttention and continuous batch processing
Proficiency in vLLM and TensorRT-LLM inference frameworks
Knowledge of INT4/FP8/AWQ low-bit quantization techniques
Experience with MoE model distributed scheduling and optimization
Skills in multi-modal AI inference deployment
Adaptation to domestic AI chips like Ascend, Hygon, TianShu
Optimization of HCCL/NCCL communication primitives
Design of high-performance operators like Attention Kernel and GEMM
Hardware-level programming with CUDA/Triton or equivalent for domestic chips
SIMD/SIMT instruction optimization and ILP

Locations

Shanghai, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Expertise in PagedAttention and continuous batch processingintermediate
Proficiency in vLLM and TensorRT-LLM inference frameworksintermediate
Knowledge of INT4/FP8/AWQ low-bit quantization techniquesintermediate
Experience with MoE model distributed scheduling and optimizationintermediate
Skills in multi-modal AI inference deploymentintermediate
Adaptation to domestic AI chips like Ascend, Hygon, TianShuintermediate
Optimization of HCCL/NCCL communication primitivesintermediate
Design of high-performance operators like Attention Kernel and GEMMintermediate
Hardware-level programming with CUDA/Triton or equivalent for domestic chipsintermediate
SIMD/SIMT instruction optimization and ILPintermediate

Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages (experience)

Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-(深圳)or(北京)or. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShanghaiChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-(深圳)or(北京)or @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

高性能计算工程师-(深圳)or(北京)or

Tencent

高性能计算工程师-(深圳)or(北京)or

Tencent

full-timePosted: Dec 9, 2025

Job Description

高性能计算工程师-(深圳)or(北京)or

📋 Job Overview

📍 Location: Shanghai, China

🏢 Business Unit: CSIG

📄 Full Description

🎯 Key Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages

🛠️ Required Skills

Expertise in PagedAttention and continuous batch processing
Proficiency in vLLM and TensorRT-LLM inference frameworks
Knowledge of INT4/FP8/AWQ low-bit quantization techniques
Experience with MoE model distributed scheduling and optimization
Skills in multi-modal AI inference deployment
Adaptation to domestic AI chips like Ascend, Hygon, TianShu
Optimization of HCCL/NCCL communication primitives
Design of high-performance operators like Attention Kernel and GEMM
Hardware-level programming with CUDA/Triton or equivalent for domestic chips
SIMD/SIMT instruction optimization and ILP

Locations

Shanghai, China

Salary

Estimated Salary Rangemedium confidence

400,000 - 800,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Expertise in PagedAttention and continuous batch processingintermediate
Proficiency in vLLM and TensorRT-LLM inference frameworksintermediate
Knowledge of INT4/FP8/AWQ low-bit quantization techniquesintermediate
Experience with MoE model distributed scheduling and optimizationintermediate
Skills in multi-modal AI inference deploymentintermediate
Adaptation to domestic AI chips like Ascend, Hygon, TianShuintermediate
Optimization of HCCL/NCCL communication primitivesintermediate
Design of high-performance operators like Attention Kernel and GEMMintermediate
Hardware-level programming with CUDA/Triton or equivalent for domestic chipsintermediate
SIMD/SIMT instruction optimization and ILPintermediate

Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse through CUDA/Triton or domestic chip low-level programming languages (experience)

Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and implementation of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of cutting-edge low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multi-modal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing power across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to lead the design and implementation of LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-(深圳)or(北京)or. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-(深圳)or(北京)or" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShanghaiChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-(深圳)or(北京)or @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap