RESUME AND JOB

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

Tencent

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

Tencent

full-timePosted: Oct 22, 2025

Job Description

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

📋 Job Overview

Tencent is seeking a Research Scientist to join their core research team in Bellevue, Washington, focusing on speech and audio understanding within large-scale multimodal model systems that integrate vision, audio, and text for comprehensive perception of the physical world. The role involves developing end-to-end large speech models for tasks like multilingual ASR, speech translation, synthesis, and general audio understanding. Responsibilities also include advancing speech representation learning, exploring multimodal alignment, and building high-quality datasets.

📍 Location: Bellevue, Washington, United States

🏢 Business Unit: TEG

📄 Full Description

Business Unit

What the Role Entails
Job Responsibilities:
We are building large-scale, native multimodal model systems that jointly support vision, audio, and text to enable comprehensive perception and understanding of the physical world. You will join the core research team focused on speech and audio, contributing to the following key research areas:
Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding.
Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications.
Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text.
Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies.

Who We Look For
Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field; or Master’s degree with several years of relevant experience.
Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures.
Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation; experience with multilingual, multitask, or end-to-end systems is a plus.
Candidates with in-depth research or practical experience in the following areas are strongly preferred:
Speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper)
Multimodal alignment and cross-modal modeling (e.g., audio-visual-text)
Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models
Proficient in deep learning frameworks such as PyTorch or TensorFlow; experience with large-scale training and distributed systems is a plus.
Familiar with Transformer-based architectures and their applications in speech and multimodal training/inference.
Location State(s)
US-Washington-Bellevue
The expected base pay range for this position in the location(s) listed above is $122,500.00 to $229,700.00 per year. Actual pay may vary depending on job-related knowledge, skills, and experience.
Employees hired for this position may be eligible for a sign on payment, relocation package, and restricted stock units, which will be evaluated on a case-by-case basis.
Subject to the terms and conditions of the plans in effect, hired applicants are also eligible for medical, dental, vision, life and disability benefits, and participation in the Company’s 401(k) plan. The Employee is also eligible for up to 15 to 25 days of vacation per year (depending on the employee’s tenure), up to 13 days of holidays throughout the calendar year, and up to 10 days of paid sick leave per year.
Your benefits may be adjusted to reflect your location, employment status, duration of employment with the company, and position level. Benefits may also be pro-rated for those who start working during the calendar year.

Equal Employment Opportunity at Tencent
As an equal opportunity employer, we firmly believe that diverse voices fuel our innovation and allow us to better serve our users and the community. We foster an environment where every employee of Tencent feels supported and inspired to achieve individual and common goals.
Work Location: US-Washington-Bellevue

🎯 Key Responsibilities

Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding
Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications
Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text
Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies

✅ Required Qualifications

Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field
or Master’s degree with several years of relevant experience
Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures
Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation

⭐ Preferred Qualifications

Experience with multilingual, multitask, or end-to-end systems
In-depth research or practical experience in speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper)
Multimodal alignment and cross-modal modeling (e.g., audio-visual-text)
Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models
Experience with large-scale training and distributed systems

🛠️ Required Skills

Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures
Proficiency in core speech system development pipelines such as ASR, TTS, or speech translation
Proficiency in deep learning frameworks such as PyTorch or TensorFlow
Familiarity with Transformer-based architectures and their applications in speech and multimodal training/inference

🎁 Benefits

Competitive base pay range of $122,500 to $229,700 per year
Eligibility for sign-on payment
Relocation package
Restricted stock units
Medical, dental, vision, life and disability benefits
Participation in the Company’s 401(k) plan
Up to 15 to 25 days of vacation per year (depending on tenure)
Up to 13 days of holidays per year
Up to 10 days of paid sick leave per year

Locations

Bellevue, Washington, United States

Salary

122,500 - 229,700 USD / yearly

Skills Required

Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architecturesintermediate
Proficiency in core speech system development pipelines such as ASR, TTS, or speech translationintermediate
Proficiency in deep learning frameworks such as PyTorch or TensorFlowintermediate
Familiarity with Transformer-based architectures and their applications in speech and multimodal training/inferenceintermediate

Required Qualifications

Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field (experience)
or Master’s degree with several years of relevant experience (experience)
Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures (experience)
Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation (experience)

Preferred Qualifications

Experience with multilingual, multitask, or end-to-end systems (experience)
In-depth research or practical experience in speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper) (experience)
Multimodal alignment and cross-modal modeling (e.g., audio-visual-text) (experience)
Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models (experience)
Experience with large-scale training and distributed systems (experience)

Responsibilities

Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding
Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications
Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text
Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies

Benefits

general: Competitive base pay range of $122,500 to $229,700 per year
general: Eligibility for sign-on payment
general: Relocation package
general: Restricted stock units
general: Medical, dental, vision, life and disability benefits
general: Participation in the Company’s 401(k) plan
general: Up to 15 to 25 days of vacation per year (depending on tenure)
general: Up to 13 days of holidays per year
general: Up to 10 days of paid sick leave per year

Target Your Resume for "Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)" , Tencent

Get personalized recommendations to optimize your resume specifically for Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems). Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentBellevueUnited StatesTEGTEG

Answer 10 quick questions to check your fit for Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems) @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

Tencent

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

Tencent

full-timePosted: Oct 22, 2025

Job Description

Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)

📋 Job Overview

📍 Location: Bellevue, Washington, United States

🏢 Business Unit: TEG

📄 Full Description

🎯 Key Responsibilities

Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding
Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications
Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text
Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies

✅ Required Qualifications

Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field
or Master’s degree with several years of relevant experience
Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures
Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation

⭐ Preferred Qualifications

Experience with multilingual, multitask, or end-to-end systems
In-depth research or practical experience in speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper)
Multimodal alignment and cross-modal modeling (e.g., audio-visual-text)
Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models
Experience with large-scale training and distributed systems

🛠️ Required Skills

Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures
Proficiency in core speech system development pipelines such as ASR, TTS, or speech translation
Proficiency in deep learning frameworks such as PyTorch or TensorFlow
Familiarity with Transformer-based architectures and their applications in speech and multimodal training/inference

🎁 Benefits

Competitive base pay range of $122,500 to $229,700 per year
Eligibility for sign-on payment
Relocation package
Restricted stock units
Medical, dental, vision, life and disability benefits
Participation in the Company’s 401(k) plan
Up to 15 to 25 days of vacation per year (depending on tenure)
Up to 13 days of holidays per year
Up to 10 days of paid sick leave per year

Locations

Bellevue, Washington, United States

Salary

122,500 - 229,700 USD / yearly

Skills Required

Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architecturesintermediate
Proficiency in core speech system development pipelines such as ASR, TTS, or speech translationintermediate
Proficiency in deep learning frameworks such as PyTorch or TensorFlowintermediate
Familiarity with Transformer-based architectures and their applications in speech and multimodal training/inferenceintermediate

Required Qualifications

Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field (experience)
or Master’s degree with several years of relevant experience (experience)
Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures (experience)
Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation (experience)

Preferred Qualifications

Experience with multilingual, multitask, or end-to-end systems (experience)
In-depth research or practical experience in speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper) (experience)
Multimodal alignment and cross-modal modeling (e.g., audio-visual-text) (experience)
Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models (experience)
Experience with large-scale training and distributed systems (experience)

Responsibilities

Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding
Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications
Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text
Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies

Benefits

general: Competitive base pay range of $122,500 to $229,700 per year
general: Eligibility for sign-on payment
general: Relocation package
general: Restricted stock units
general: Medical, dental, vision, life and disability benefits
general: Participation in the Company’s 401(k) plan
general: Up to 15 to 25 days of vacation per year (depending on tenure)
general: Up to 13 days of holidays per year
general: Up to 10 days of paid sick leave per year

Target Your Resume for "Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)" , Tencent

Get personalized recommendations to optimize your resume specifically for Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems). Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems)" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentBellevueUnited StatesTEGTEG

Answer 10 quick questions to check your fit for Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems) @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap