Recent progress on spatio-temporal feature learning has pushed the state-of-the-art of action recognition at new levels. Building upon the success of convolutional neural networks in image recognition, currently best performing action recognition models introduce deep temporal modeling in spatial 2D CNNs to handle time in video, with minimal overhead in parameters and computation. They learn better features faster and with less supervision than space-time, full-3D CNNs. However, such architectures lack the introspective means for grounded reasoning and decision making. Video understanding relies on spatial-temporal reasoning to take place. Visual explanation methods are being used to inspect video models after learning, but they are not yet applied to guide the learning.
This research project is set out to advance video understanding along these lines, by lifting deep architectures for recognition and question answering by means of innate visual explanations. Visual explanations will be built-in, and hence, interact in learning structured representations. This way, the ’descriptor bottleneck’ of existing architectures will be eliminated to facilitate visual grounding. The structural constraint injected via innate explanations will enable training with less data at improved generalization, as constraints shape manifolds into the parameter space that are hard to discover from point-wise supervisions alone.
This project is supported by Amazon AWS Machine Learning Research Awards and CINECA ISCRA.
S. Sudhakaran, S. Escalera, O. Lanz: Gate-Shift Networks for Video Action Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
S. Sudhakaran, S. Escalera, O. Lanz: FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2020 Challenge. arXiv:2006.13725, 2020.
S. Sudhakaran, S. Escalera, O. Lanz: LSTA: Long Short-Term Attention for Egocentric Action Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
S. Sudhakaran, S. Escalera, O. Lanz: FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2019 Challenge. arXiv:1906.08960, 2019.
S. Sudhakaran, O. Lanz: Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition. British Machine Vision Conference (BMVC), 2018.
Results reported in the papers can be reproduced with the codes published at https://github.com/swathikirans.