RESUME AND JOB

高性能计算工程师-上海

Tencent

高性能计算工程师-上海

Tencent

full-timePosted: Nov 30, 2025

Job Description

高性能计算工程师-上海

📋 Job Overview

Tencent is seeking a High-Performance Computing Engineer in Shanghai to lead advanced optimizations for large-scale LLMs and AI inference engines. The role involves designing and implementing cutting-edge techniques for model quantization, distributed scheduling, and hardware-specific adaptations to achieve peak performance. Responsibilities include kernel-level optimizations, heterogeneous chip support, and innovative operator designs to push AI inference to hardware limits.

📍 Location: Shenzhen, China

🏢 Business Unit: CSIG

📄 Full Description

1.超大规模LLM性能工程：主导并规划千亿参数级大模型的极致性能优化技术路线。负责 PagedAttention、连续批处理等核心调度策略的深度定制与生产级架构设计，负责 vLLM/TensorRT-LLM 等主流推理框架的内核级优化与落地；
2.低比特与稀疏模型优化：牵头 INT4/FP8/AWQ 等前沿低比特量化技术的工业级系统化落地，平衡精度与计算效率。并设计面向 MoE 模型的分布式调度、路由、显存管理及跨卡通信的优化方案；
3.统一与多模态架构：定义并设计一套具备长期扩展性的统一 AI 推理引擎架构，以支撑自回归生成任务，并前瞻性地解决多模态大模型（如视觉-语言模型）的协同推理部署挑战；
4.异构算力与国产化适配：主导推理引擎在国产AI芯片（如昇腾、海光、天数等）平台上的战略级移植、生态适配与性能优化。对 HCCL/NCCL 等通信原语进行深度优化和定制，实现跨异构架构的算力自主可控；
5.核心算子优化与指令架构创新 (Enhanced Focus):深度介入 GPU/NPU 硬件底层，主导设计和实现LLM特有高性能算子。重点包括：高性能Attention Kernel、矩阵乘法（GEMM）的深度定制与融合、KV Cache读写优化等关键算子；
6.具备深入理解和利用硬件指令集架构（ISA）和微架构（Microarchitecture）的能力，通过 CUDA/Triton 或国产芯片底层编程语言，进行SIMD/SIMT指令优化、指令级并行（ILP）及寄存器重用等，将LLM推理性能推向硬件理论极限。

🎯 Key Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and deployment of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of frontier low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multimodal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to design and implement LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
Experience with CUDA/Triton or underlying programming languages for domestic chips
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse

🛠️ Required Skills

Performance optimization for ultra-large-scale LLMs
Core scheduling strategies (PagedAttention, continuous batch processing)
Inference frameworks (vLLM, TensorRT-LLM)
Low-bit quantization (INT4/FP8/AWQ)
MoE model distributed scheduling and optimization
Unified and multimodal AI inference engine architecture
Heterogeneous computing adaptation for domestic AI chips (Ascend, Hygon, TianShu)
Communication primitives (HCCL/NCCL)
High-performance operators (Attention Kernel, GEMM, KV Cache)
Hardware ISA and microarchitecture utilization
CUDA/Triton or domestic chip programming
SIMD/SIMT optimization, ILP, register reuse

Locations

Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Performance optimization for ultra-large-scale LLMsintermediate
Core scheduling strategies (PagedAttention, continuous batch processing)intermediate
Inference frameworks (vLLM, TensorRT-LLM)intermediate
Low-bit quantization (INT4/FP8/AWQ)intermediate
MoE model distributed scheduling and optimizationintermediate
Unified and multimodal AI inference engine architectureintermediate
Heterogeneous computing adaptation for domestic AI chips (Ascend, Hygon, TianShu)intermediate
Communication primitives (HCCL/NCCL)intermediate
High-performance operators (Attention Kernel, GEMM, KV Cache)intermediate
Hardware ISA and microarchitecture utilizationintermediate
CUDA/Triton or domestic chip programmingintermediate
SIMD/SIMT optimization, ILP, register reuseintermediate

Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
Experience with CUDA/Triton or underlying programming languages for domestic chips (experience)
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse (experience)

Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and deployment of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of frontier low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multimodal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to design and implement LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-上海" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-上海. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-上海" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShenzhenChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-上海 @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

高性能计算工程师-上海

Tencent

高性能计算工程师-上海

Tencent

full-timePosted: Nov 30, 2025

Job Description

高性能计算工程师-上海

📋 Job Overview

📍 Location: Shenzhen, China

🏢 Business Unit: CSIG

📄 Full Description

🎯 Key Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and deployment of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of frontier low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multimodal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to design and implement LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

✅ Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture
Experience with CUDA/Triton or underlying programming languages for domestic chips
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse

🛠️ Required Skills

Performance optimization for ultra-large-scale LLMs
Core scheduling strategies (PagedAttention, continuous batch processing)
Inference frameworks (vLLM, TensorRT-LLM)
Low-bit quantization (INT4/FP8/AWQ)
MoE model distributed scheduling and optimization
Unified and multimodal AI inference engine architecture
Heterogeneous computing adaptation for domestic AI chips (Ascend, Hygon, TianShu)
Communication primitives (HCCL/NCCL)
High-performance operators (Attention Kernel, GEMM, KV Cache)
Hardware ISA and microarchitecture utilization
CUDA/Triton or domestic chip programming
SIMD/SIMT optimization, ILP, register reuse

Locations

Shenzhen, China

Salary

Estimated Salary Rangemedium confidence

300,000 - 600,000 CNY / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Performance optimization for ultra-large-scale LLMsintermediate
Core scheduling strategies (PagedAttention, continuous batch processing)intermediate
Inference frameworks (vLLM, TensorRT-LLM)intermediate
Low-bit quantization (INT4/FP8/AWQ)intermediate
MoE model distributed scheduling and optimizationintermediate
Unified and multimodal AI inference engine architectureintermediate
Heterogeneous computing adaptation for domestic AI chips (Ascend, Hygon, TianShu)intermediate
Communication primitives (HCCL/NCCL)intermediate
High-performance operators (Attention Kernel, GEMM, KV Cache)intermediate
Hardware ISA and microarchitecture utilizationintermediate
CUDA/Triton or domestic chip programmingintermediate
SIMD/SIMT optimization, ILP, register reuseintermediate

Required Qualifications

Deep understanding and utilization of hardware instruction set architecture (ISA) and microarchitecture (experience)
Experience with CUDA/Triton or underlying programming languages for domestic chips (experience)
Ability to perform SIMD/SIMT instruction optimization, instruction-level parallelism (ILP), and register reuse (experience)

Responsibilities

Lead and plan extreme performance optimization technology roadmap for billion-parameter large models
Responsible for deep customization and production-level architecture design of core scheduling strategies like PagedAttention and continuous batch processing
Handle kernel-level optimization and deployment of mainstream inference frameworks such as vLLM/TensorRT-LLM
Lead industrial-grade systematic implementation of frontier low-bit quantization technologies like INT4/FP8/AWQ, balancing accuracy and computational efficiency
Design optimization schemes for distributed scheduling, routing, memory management, and cross-card communication for MoE models
Define and design a unified AI inference engine architecture with long-term scalability to support autoregressive generation tasks
Proactively address collaborative inference deployment challenges for multimodal large models (e.g., vision-language models)
Lead strategic porting, ecosystem adaptation, and performance optimization of inference engines on domestic AI chips (e.g., Ascend, Hygon, TianShu)
Deeply optimize and customize communication primitives like HCCL/NCCL to achieve autonomous and controllable computing across heterogeneous architectures
Deeply intervene in GPU/NPU hardware underpinnings to design and implement LLM-specific high-performance operators
Focus on high-performance Attention Kernel, deep customization and fusion of matrix multiplication (GEMM), and KV Cache read/write optimization

Target Your Resume for "高性能计算工程师-上海" , Tencent

Get personalized recommendations to optimize your resume specifically for 高性能计算工程师-上海. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "高性能计算工程师-上海" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentShenzhenChinaCSIGCSIG

Answer 10 quick questions to check your fit for 高性能计算工程师-上海 @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap