Zeeshan Khan - Grounded VidSitu

Grounded Video Situation Recognition

GVSR

GVSR is structured dense video understanding task. It is built on top of VidSitu. A large scale dataset containing videos of 10 seconds from complex movie scenes. A video is divded in multiple events of ~2 seconds each. Each event is associated with a salient action verb, each action verb is associated with a semantic role frame eg: agent, patient, tool, location, manner etc. Each role is is annotated using a free form text caption. All the role entities are coreferenced in all the events of a video.

VideoWhisperer

We propose a new 3-stage Transformer based model for joint structured prediction of Verbs, Role-Captions, and Groundings. Stage-1 learns the contextualised object and event embeddings through a video-object transformerm that is used to predict the verb-role pairs for each event. Stage-2 models all the predicted roles by creating role queries contextualised by event embeddings, and attends to all the object proposals through the role-object transformer decoder, to find the best entity that represents a role. The output embeddings of the roles are fed to the caption generation module. The cross-attention of the role-object decoder ranks all the object proposals enabling localization of each role.

VideoWhisperer in action

More Results

For more results on 10 videos Click Here

BibTeX

@inproceedings{
khan2022grounded,
title={Grounded Video Situation Recognition},
author={Zeeshan Khan and C.V. Jawahar and Makarand Tapaswi},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=yRhbHp_Vh8e}
}