Skip to main navigation Skip to search Skip to main content

Unifying Specialized Visual Encoders for Video Language Models

  • Jihoon Chung
  • , Tyler Zhu
  • , Max Gonzalez Saez-Diez
  • , Juan Carlos Niebles
  • , Honglu Zhou
  • , Olga Russakovsky

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent singleencoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.

Original languageEnglish (US)
Pages (from-to)10879-10900
Number of pages22
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Externally publishedYes
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: Jul 13 2025Jul 19 2025

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Unifying Specialized Visual Encoders for Video Language Models'. Together they form a unique fingerprint.

Cite this