Welcome to the Vision Computing and Learning research LAB at the Free University of Bozen-Bolzano

The group led by Prof. Oswald Lanz conducts research in the field of computer vision and machine learning. We are interested in all aspects of image and video understanding – from low-level perception to high level interpretation – with particular emphasis on scenes involving humans and applications supporting humans. In our research, we heavily build on advances in modern machine learning, and develop novel methods that by design enjoy inductive priors to learn efficiently. Our ultimate goal is to build machines that see, and to contribute to the advancement of AI in this context.

Keywords: detection and tracking, action recognition, text-video retrieval, anomaly classification and segmentation, neural rendering, neural architecture search, representation learning, computer vision, artificial intelligence

Research Topics

Video Understanding. We develop deep learning methods for action recognition and prediction, human tracking and object detection, text-video retrieval and video question answering.

Anomaly Detection. We develop data-efficient methods for anomaly detection in volumetric data, and for anomaly detection in appearance and geometry through neural reconstruction.

Generative AI. We focus on generation of realistic training data for vision tasks, on neural rendering for generative design, and on geometric deep learning of physics-informed simulations.

Auto ML. We automate the design of neural networks for image and video understanding tasks.


Oswald Lanz
Head Professor
Simone Fabbrizzi
Research Assistant

Tsung-Ming Tai
PhD Student
Cynthia I. Ugwu
PhD Student

Sofia Casarin
PhD Student
Emanuele Caruso
PhD Student
Research Collaborations
Industry Collaborations

Inductive Attention for Video Action Anticipation
Arxiv, 2023

Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time – a model cannot directly map the video frames to the target action to solve the anticipation task. Instead, the temporal inference is required to associate the relevant evidence with possible future actions. Consequently, existing solutions based on the action recognition models are only suboptimal… Read more

Gate-Shift-Fuse for Video Action Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature … Read more

Video Question Answering Supported by a Multi-task Learning Objective
Multimedia Tools and Applications, 2023

Video Question Answering (VideoQA) concerns the realization of models able to analyze a video, and produce a meaningful answer to visual content-related questions. To encode the given question, word embedding techniques are used to compute a representation of the tokens suitable for neural networks. Yet almost all the works in the literature use the same technique, although recent advancements in NLP brought better solutions. This lack of analysis is a major shortcoming. To address it, in this paper we present a twofold contribution about this inquiry and its relation with question encoding. First of all, we integrate four of the most popular word … Read more

Learning to Recognise Actions on Objects in Egocentric Video with Attention Dictionaries
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from … Read more

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
ACM MultiMedia, 2022

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring … Read more