Weakly Supervised Video Re-localization Through Multi-agent-reinforced Switchable Network

Yuan Zhou, Axin Guo, Shuwei Huo, Yu Liu, Sun Yuan Kung

Research output: Contribution to journalArticlepeer-review


The objective of video re-localization (VRL) is to localize a successive sequence of frames, namely, the target moment, from untrimmed reference videos that semantically correspond to a given query video. During training, the weakly supervised setting of VRL provides only coarse-grained video-level rather than frame-level annotations. For the weakly supervised VRL (WS-VRL) task, obtaining effective video feature representations that can be used to evaluate the relevance between videos and localizing the accurate temporal boundaries of the target moment remain challenging. In this paper, a novel multi-agent-reinforced switchable network (MARS) is proposed to address these challenges. MARS can adaptively guide video feature encoding and moment localization using multiple learned agents. Specifically, an agent-controlled switchable encoder is used to obtain effective video feature representations, and an agent-reinforced boundary localizer is used to determine accurate localized moments through progressive refinement. Furthermore, a relevance-oriented reward generator was designed to estimate the relevance of the localized moment to the query video and assign a reward to multiple agents. The effectiveness of the proposed MARS model was verified through extensive experiments on the ActivityNet-VRL dataset.

Original languageEnglish (US)
Pages (from-to)1
Number of pages1
JournalIEEE Transactions on Circuits and Systems for Video Technology
StateAccepted/In press - 2023
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Media Technology
  • Electrical and Electronic Engineering


  • Annotations
  • Encoding
  • Generators
  • Location awareness
  • Switches
  • Task analysis
  • Training
  • multi-agent
  • switchable network
  • weakly supervised video re-localization


Dive into the research topics of 'Weakly Supervised Video Re-localization Through Multi-agent-reinforced Switchable Network'. Together they form a unique fingerprint.

Cite this