Zeeshan Khan C.V. Jawahar Makarand Tapaswi
GVSR is structured dense video understanding task. It is built on top of VidSitu. A large scale dataset containing videos of 10 seconds from complex movie scenes. A video is divded in multiple events of ~2 seconds each. Each event is associated with a salient action verb, each action verb is associated with a semantic role frame eg: agent, patient, tool, location, manner etc. Each role is is annotated using a free form text caption. All the role entities are coreferenced in all the events of a video.
For more results on 10 videos Click Here
@inproceedings{
khan2022grounded,
title={Grounded Video Situation Recognition},
author={Zeeshan Khan and C.V. Jawahar and Makarand Tapaswi},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=yRhbHp_Vh8e}
}