Abstract
The objective of video re-localization (VRL) is to localize a successive sequence of frames, namely, the target moment, from untrimmed reference videos that semantically correspond to a given query video. During training, the weakly supervised setting of VRL provides only coarse-grained video-level rather than frame-level annotations. For the weakly supervised VRL (WS-VRL) task, obtaining effective video feature representations that can be used to evaluate the relevance between videos and localizing the accurate temporal boundaries of the target moment remain challenging. In this paper, a novel multi-agent-reinforced switchable network (MARS) is proposed to address these challenges. MARS can adaptively guide video feature encoding and moment localization using multiple learned agents. Specifically, an agent-controlled switchable encoder is used to obtain effective video feature representations, and an agent-reinforced boundary localizer is used to determine accurate localized moments through progressive refinement. Furthermore, a relevance-oriented reward generator was designed to estimate the relevance of the localized moment to the query video and assign a reward to multiple agents. The effectiveness of the proposed MARS model was verified through extensive experiments on the ActivityNet-VRL dataset.
Original language | English (US) |
---|---|
Pages (from-to) | 6116-6127 |
Number of pages | 12 |
Journal | IEEE Transactions on Circuits and Systems for Video Technology |
Volume | 34 |
Issue number | 7 |
DOIs | |
State | Published - 2024 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Media Technology
- Electrical and Electronic Engineering
Keywords
- Weakly supervised video re-localization
- multi-agent
- switchable network