RESUME AND JOB

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

Tencent

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

Tencent

internshipPosted: Nov 12, 2025

Job Description

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

📋 Job Overview

Tencent AI Lab in Bellevue, WA, is seeking research interns for 2026 to develop novel techniques in multimodal large language models focusing on speech, music, audio, vision, and language processing. Interns will collaborate with researchers on innovative projects aimed at advancing AI capabilities in perception, cognition, and creativity, with opportunities to publish results. The role involves working on multimodal pretraining, efficient architectures, memory enhancement, and advanced processing techniques for real-world applications.

📍 Location: Bellevue, Washington, United States

🏢 Business Unit: TEG

📄 Full Description

Business Unit

What the Role Entails
About Tencent AI Lab at Seattle Area
Tencent is a leading internet company in China. Tencent AI Lab at Seattle Area was established in May 2017. The lab strives to continuously improve AI's capability in perception, cognition, and creativity. Researchers there aim at solving challenging real-world problems with advanced technologies and publish extensively at top conferences and journals.Research Internship: Multimodal LLM (Speech/Music/Audio/Vision/Language)Tencent AI Lab is dedicated to advancing cutting-edge AI technologies, with a particular focus on innovative breakthroughs in large foundation models. The lab's long-term ambition is to drive the development of Artificial General Intelligence (AGI), and ultimately, Artificial Superintelligence (ASI). We are seeking research interns who are interested in developing novel speech/music/audio/vision/language processing techniques and large multimodal models for our Seattle area office located at Bellevue WA for the year 2026.
Every research intern will work with researchers on a research project aimed at attacking one of the core problems by inventing cutting edge techniques. We encourage discussions and collaborations between researchers and interns. Interns are also encouraged to publish the results from the internship. Our projects span a wide range of areas, including developing more effective multimodal pretraining and post-training strategies for audio, speech, music, image, and video understanding and generation. We aim to enable fully duplex conversations, design more efficient large-model architectures, enhance multimodal memory and reasoning capabilities, and advance novel audio, speech, music, image, and video processing techniques—such as encoding, tokenization, and representation learning—with a focus on multimodal applications and end-to-end large models.

Who We Look For
Requirements & QualificationsThe ideal intern candidates are those who
are Ph.D. students in computer science, electrical engineering, mathematics or a related field,
are self-motivated and excited about developing novel techniques,
have research experiences in natural language processing, speech, audio, and music processing, computer vision, dialog system, or machine learning,
have good publication track records and history of creativity and intellectual flexibility,
can program skillfully in Python and/or C++ and have experiences in using one of the leading deep learning toolkits.
Intern duration: 3 months (with the possibility of extension). Can start any time in the year 2026.
Location State(s)
US-Washington-Bellevue
The expected base pay range for this position in the location(s) listed above is $80,169.00 to $120,000.14 per year. Actual pay may vary depending on job-related knowledge, skills, and experience.
This position will be eligible for 1 hour of paid sick leave for every 30 hours worked and up to 13 paid holidays throughout the calendar year. Subject to the terms and conditions of the applicable plans then in effect, full-time interns are also eligible to enroll in the Company-sponsored medical plan.

Equal Employment Opportunity at Tencent
As an equal opportunity employer, we firmly believe that diverse voices fuel our innovation and allow us to better serve our users and the community. We foster an environment where every employee of Tencent feels supported and inspired to achieve individual and common goals.
Work Location: US-Washington-Bellevue

🎯 Key Responsibilities

Work with researchers on a research project aimed at attacking core problems by inventing cutting-edge techniques
Engage in discussions and collaborations between researchers and interns
Publish results from the internship
Develop more effective multimodal pretraining and post-training strategies for audio, speech, music, image, and video understanding and generation
Enable fully duplex conversations
Design more efficient large-model architectures
Enhance multimodal memory and reasoning capabilities
Advance novel audio, speech, music, image, and video processing techniques—such as encoding, tokenization, and representation learning—with a focus on multimodal applications and end-to-end large models

✅ Required Qualifications

Ph.D. students in computer science, electrical engineering, mathematics or a related field
Self-motivated and excited about developing novel techniques
Research experiences in natural language processing, speech, audio, and music processing, computer vision, dialog system, or machine learning
Good publication track records and history of creativity and intellectual flexibility
Can program skillfully in Python and/or C++ and have experiences in using one of the leading deep learning toolkits
Intern duration: 3 months (with the possibility of extension). Can start any time in the year 2026

🛠️ Required Skills

Programming in Python and/or C++
Experience in using leading deep learning toolkits
Research experience in natural language processing, speech, audio, music processing, computer vision, dialog systems, or machine learning
Self-motivation and excitement for developing novel techniques
Creativity and intellectual flexibility

🎁 Benefits

Expected base pay range of $80,169.00 to $120,000.14 per year
1 hour of paid sick leave for every 30 hours worked
Up to 13 paid holidays throughout the calendar year
Eligibility to enroll in the Company-sponsored medical plan for full-time interns

Locations

Bellevue, Washington, United States

Salary

80,169 - 120,000.14 USD / yearly

Skills Required

Programming in Python and/or C++intermediate
Experience in using leading deep learning toolkitsintermediate
Research experience in natural language processing, speech, audio, music processing, computer vision, dialog systems, or machine learningintermediate
Self-motivation and excitement for developing novel techniquesintermediate
Creativity and intellectual flexibilityintermediate

Required Qualifications

Ph.D. students in computer science, electrical engineering, mathematics or a related field (experience)
Self-motivated and excited about developing novel techniques (experience)
Research experiences in natural language processing, speech, audio, and music processing, computer vision, dialog system, or machine learning (experience)
Good publication track records and history of creativity and intellectual flexibility (experience)
Can program skillfully in Python and/or C++ and have experiences in using one of the leading deep learning toolkits (experience)
Intern duration: 3 months (with the possibility of extension). Can start any time in the year 2026 (experience)

Responsibilities

Work with researchers on a research project aimed at attacking core problems by inventing cutting-edge techniques
Engage in discussions and collaborations between researchers and interns
Publish results from the internship
Develop more effective multimodal pretraining and post-training strategies for audio, speech, music, image, and video understanding and generation
Enable fully duplex conversations
Design more efficient large-model architectures
Enhance multimodal memory and reasoning capabilities
Advance novel audio, speech, music, image, and video processing techniques—such as encoding, tokenization, and representation learning—with a focus on multimodal applications and end-to-end large models

Benefits

general: Expected base pay range of $80,169.00 to $120,000.14 per year
general: 1 hour of paid sick leave for every 30 hours worked
general: Up to 13 paid holidays throughout the calendar year
general: Eligibility to enroll in the Company-sponsored medical plan for full-time interns

Target Your Resume for "Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334" , Tencent

Get personalized recommendations to optimize your resume specifically for Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentBellevueUnited StatesTEGTEG

Answer 10 quick questions to check your fit for Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334 @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

Tencent

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

Tencent

internshipPosted: Nov 12, 2025

Job Description

Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334

📋 Job Overview

📍 Location: Bellevue, Washington, United States

🏢 Business Unit: TEG

📄 Full Description

🎯 Key Responsibilities

Work with researchers on a research project aimed at attacking core problems by inventing cutting-edge techniques
Engage in discussions and collaborations between researchers and interns
Publish results from the internship
Develop more effective multimodal pretraining and post-training strategies for audio, speech, music, image, and video understanding and generation
Enable fully duplex conversations
Design more efficient large-model architectures
Enhance multimodal memory and reasoning capabilities
Advance novel audio, speech, music, image, and video processing techniques—such as encoding, tokenization, and representation learning—with a focus on multimodal applications and end-to-end large models

✅ Required Qualifications

Ph.D. students in computer science, electrical engineering, mathematics or a related field
Self-motivated and excited about developing novel techniques
Research experiences in natural language processing, speech, audio, and music processing, computer vision, dialog system, or machine learning
Good publication track records and history of creativity and intellectual flexibility
Can program skillfully in Python and/or C++ and have experiences in using one of the leading deep learning toolkits
Intern duration: 3 months (with the possibility of extension). Can start any time in the year 2026

🛠️ Required Skills

Programming in Python and/or C++
Experience in using leading deep learning toolkits
Research experience in natural language processing, speech, audio, music processing, computer vision, dialog systems, or machine learning
Self-motivation and excitement for developing novel techniques
Creativity and intellectual flexibility

🎁 Benefits

Expected base pay range of $80,169.00 to $120,000.14 per year
1 hour of paid sick leave for every 30 hours worked
Up to 13 paid holidays throughout the calendar year
Eligibility to enroll in the Company-sponsored medical plan for full-time interns

Locations

Bellevue, Washington, United States

Salary

80,169 - 120,000.14 USD / yearly

Skills Required

Programming in Python and/or C++intermediate
Experience in using leading deep learning toolkitsintermediate
Research experience in natural language processing, speech, audio, music processing, computer vision, dialog systems, or machine learningintermediate
Self-motivation and excitement for developing novel techniquesintermediate
Creativity and intellectual flexibilityintermediate

Required Qualifications

Ph.D. students in computer science, electrical engineering, mathematics or a related field (experience)
Self-motivated and excited about developing novel techniques (experience)
Research experiences in natural language processing, speech, audio, and music processing, computer vision, dialog system, or machine learning (experience)
Good publication track records and history of creativity and intellectual flexibility (experience)
Can program skillfully in Python and/or C++ and have experiences in using one of the leading deep learning toolkits (experience)
Intern duration: 3 months (with the possibility of extension). Can start any time in the year 2026 (experience)

Responsibilities

Work with researchers on a research project aimed at attacking core problems by inventing cutting-edge techniques
Engage in discussions and collaborations between researchers and interns
Publish results from the internship
Develop more effective multimodal pretraining and post-training strategies for audio, speech, music, image, and video understanding and generation
Enable fully duplex conversations
Design more efficient large-model architectures
Enhance multimodal memory and reasoning capabilities
Advance novel audio, speech, music, image, and video processing techniques—such as encoding, tokenization, and representation learning—with a focus on multimodal applications and end-to-end large models

Benefits

general: Expected base pay range of $80,169.00 to $120,000.14 per year
general: 1 hour of paid sick leave for every 30 hours worked
general: Up to 13 paid holidays throughout the calendar year
general: Eligibility to enroll in the Company-sponsored medical plan for full-time interns

Target Your Resume for "Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334" , Tencent

Get personalized recommendations to optimize your resume specifically for Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334" , Tencent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

TencentBellevueUnited StatesTEGTEG

Answer 10 quick questions to check your fit for Research Internship- Multimodal LLM (Speech/Music/Audio/Vision/Language) 106334 @ Tencent.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap