Grounded Video Situation Recognition

Zeeshan Khan        C.V. Jawahar        Makarand Tapaswi       

CVIT, IIIT Hyderabad

NeurIPS 2022

Paper        Dataset        Code       

GVSR

GVSR is structured dense video understanding task. It is built on top of VidSitu. A large scale dataset containing videos of 10 seconds from complex movie scenes. A video is divded in multiple events of ~2 seconds each. Each event is associated with a salient action verb, each action verb is associated with a semantic role frame eg: agent, patient, tool, location, manner etc. Each role is is annotated using a free form text caption. All the role entities are coreferenced in all the events of a video.

 

Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. GVSR affords this by recognising the action verbs, their corresponding roles, and localising them in the spatio-temporal domain in a weakly supervised setting, i.e. the supervision for grounding is provided only in form of role-captions without any ground truth bounding boxes.


VideoWhisperer


 

We propose a new 3-stage Transformer based model for joint structured prediction of Verbs, Role-Captions, and Groundings. Stage-1 learns the contextualised object and event embeddings through a video-object transformerm that is used to predict the verb-role pairs for each event. Stage-2 models all the predicted roles by creating role queries contextualised by event embeddings, and attends to all the object proposals through the role-object transformer decoder, to find the best entity that represents a role. The output embeddings of the roles are fed to the caption generation module. The cross-attention of the role-object decoder ranks all the object proposals enabling localization of each role.


VideoWhisperer in action


 

 

 

More Results

For more results on 10 videos Click Here

BibTeX

@inproceedings{
khan2022grounded,
title={Grounded Video Situation Recognition},
author={Zeeshan Khan and C.V. Jawahar and Makarand Tapaswi},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=yRhbHp_Vh8e}
}