Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable Network

Yuan Zhou, Axin Guo, Shuwei Huo, Yu Liu, Sun Yuan Kung

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The objective of video re-localization (VRL) is to localize a successive sequence of frames, namely, the target moment, from untrimmed reference videos that semantically correspond to a given query video. During training, the weakly supervised setting of VRL provides only coarse-grained video-level rather than frame-level annotations. For the weakly supervised VRL (WS-VRL) task, obtaining effective video feature representations that can be used to evaluate the relevance between videos and localizing the accurate temporal boundaries of the target moment remain challenging. In this paper, a novel multi-agent-reinforced switchable network (MARS) is proposed to address these challenges. MARS can adaptively guide video feature encoding and moment localization using multiple learned agents. Specifically, an agent-controlled switchable encoder is used to obtain effective video feature representations, and an agent-reinforced boundary localizer is used to determine accurate localized moments through progressive refinement. Furthermore, a relevance-oriented reward generator was designed to estimate the relevance of the localized moment to the query video and assign a reward to multiple agents. The effectiveness of the proposed MARS model was verified through extensive experiments on the ActivityNet-VRL dataset.

Original languageEnglish (US)
Pages (from-to)6116-6127
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number7
DOIs
StatePublished - 2024
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Media Technology
  • Electrical and Electronic Engineering

Keywords

  • Weakly supervised video re-localization
  • multi-agent
  • switchable network

Fingerprint

Dive into the research topics of 'Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable Network'. Together they form a unique fingerprint.

Cite this