SRL4V - Structured Representation Learning for Video Understanding

Recent progress on spatio-temporal feature learning has pushed the state-of-the-art of action recognition at new levels. Building upon the success of convolutional neural networks in image recognition, currently best performing action recognition models introduce deep temporal modeling in spatial 2D CNNs to handle time in video, with minimal overhead in parameters and computation. They learn better features faster and with less supervision than space-time, full-3D CNNs. However, such architectures lack the introspective means for grounded reasoning and decision making. Video understanding relies on spatial-temporal reasoning to take place. Visual explanation methods are being used to inspect video models after learning, but they are not yet applied to guide the learning.

This research project is set out to advance video understanding along these lines, by lifting deep architectures for recognition and question answering by means of innate visual explanations. Visual explanations will be built-in, and hence, interact in learning structured representations. This way, the ’descriptor bottleneck’ of existing architectures will be eliminated to facilitate visual grounding. The structural constraint injected via innate explanations will enable training with less data at improved generalization, as constraints shape manifolds into the parameter space that are hard to discover from point-wise supervisions alone.

This research is supported by Amazon AWS Machine Learning Research Awards and NVIDIA AI Technology Center and CINECA through the Italian SuperComputing Resource Allocation - ISCRA and start-up grant IN2814 of Free University of Bozen-Bolzano.

Publications

S. Sudhakaran, S. Escalera, O. Lanz: Gate-Shift-Fuse for Video Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023. [code]

A. Falcon, G. Serra, O. Lanz: Video Question Answering Supported by a Multi-task Learning Objective. Multimedia Tools and Applications, 2023. [code]

T.M. Tai, G. Fiameni, C.K. Lee, S. See, O. Lanz: Inductive Attention for Video Action Anticipation. arXiv:2212.08830, 2023.

A. Falcon, G. Serra, O. Lanz: A Feature-Space Multimodal Data Augmentation Technique for Text-Video Retrieval. ACM MultiMedia, 2022. [code]

T.M. Tai, O. Lanz, G. Fiameni, Y.K. Wong, S.S. Poon, C.K. Lee, K.C. Cheung, S. See: NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022. EPIC-KITCHENS 2022 Challenges Report, CVPR Workshops 2022.

A. Falcon, G. Serra, S. Escalera, O. Lanz: UniUD-FBK-UB-UniBZ Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2022. EPIC-KITCHENS 2022 Challenges Report, CVPR Workshops 2022.

T.M. Tai, G. Fiameni, C.K. Lee, S. See, O. Lanz: Unified Recurrence Modeling for Video Action Anticipation. International Conference on Pattern Recognition (ICPR), 2022. [code]

T.M. Tai, G. Fiameni, C.K. Lee, O. Lanz: Higher Order Recurrent Network with Space-Time Attention for Video Early Action Recognition. International Conference on Image Processing (ICIP), 2022. [code]

A. Falcon, S. Sudhakaran, G. Serra, S. Escalera, O. Lanz: Relevance-based Margin for Contrastively-trained Video Retrieval Models. International Conference on Multimodal Retrieval (ICMR), 2022. [code]

A. Falcon, G. Serra, O. Lanz: Learning Video Retrieval Models with Relevance-Aware Online Mining. International Conference on Image Analysis and Processing (ICIAP), 2022. [code]

S. Sudhakaran, A. Bulat, J.M. Perez-Rua, A. Falcon, S. Escalera, O. Lanz, B. Martinez, G. Tzimiropoulos: SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. EPIC-KITCHENS-100 2021 Challenges Report, CVPR Workshops 2021.

T.M. Tai, G. Fiameni, C.K. Lee, O. Lanz: Higher Order Recurrent Space-Time Transformer for Video Action Prediction. arXiv:2104.08665, 2021.

S. Sudhakaran, S. Escalera, O. Lanz: Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.

S. Sudhakaran, S. Escalera, O. Lanz: Gate-Shift Networks for Video Action Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [code]

A. Falcon, O. Lanz, G. Serra: Data Augmentation Techniques for the Video Question Answering Task. European Conference on Computer Vision (ECCV) Workshops, 2020.

S. Sudhakaran, S. Escalera, O. Lanz: FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2020 Challenge. EPIC-KITCHENS-55 2020 Challenges Report, CVPR Workshops 2020.

S. Sudhakaran, S. Escalera, O. Lanz: LSTA: Long Short-Term Attention for Egocentric Action Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [code]

S. Sudhakaran, S. Escalera, O. Lanz: FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2019 Challenge. EPIC-KITCHENS 2019 Challenges Report, CVPR Workshops 2019.

S. Sudhakaran, O. Lanz: Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition. British Machine Vision Conference (BMVC), 2018. [code]

News

NEW PhD opportunity with application deadline September 15, 2022, see here for more information.

PhD opportunity with application deadline July 1, 2022, see here for more information.