10 research outputs found
Leveraging triplet loss for unsupervised action segmentation
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.This work was supported by the project PID2019-110977GA-I00 funded by MCIN/ AEI/ 10.13039/501100011033 and by ”ESF Investing in your future”Peer ReviewedPostprint (author's final draft
Efficient Keyword Spotting by capturing long-range interactions with Temporal Lambda Networks
Models based on attention mechanisms have shown unprecedented speech
recognition performance. However, they are computationally expensive and
unnecessarily complex for keyword spotting, a task targeted to small-footprint
devices. This work explores the application of Lambda networks, an alternative
framework for capturing long-range interactions without attention, for the
keyword spotting task. We propose a novel \textit{ResNet}-based model by
swapping the residual blocks by temporal Lambda layers. Furthermore, the
proposed architecture is built upon uni-dimensional temporal convolutions that
further reduce its complexity. The presented model does not only reach
state-of-the-art accuracies on the Google Speech Commands dataset, but it is
85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15)
counterparts while being up to 100 times faster. To the best of our knowledge,
this is the first attempt to explore the Lambda framework within the speech
domain and therefore, we unravel further research of new interfaces based on
this architecture.Comment: speech recognition, keyword spotting, lambda network
Efficient keyword spotting by capturing long-range interactions with temporal lambda networks
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel ResNet-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (ResNet15) counterparts while being up to 100× faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and therefore, we unravel further research of new interfaces based on this architecture.Peer ReviewedPostprint (author's final draft
Learning graph-based event representations for unconstrained video segmentation
Institut de Robòtica i Informàtica IndustrialRecent research has shown that, in particular domains, unsupervised learning algorithms are achieving on par, or even better performance than fully supervised algorithms, avoiding the need of human labelled data. The division of a video into events has been an active research topic through unsupervised algorithms, exploiting relations in the video itself for a temporal segmentation task. In particular, self-supervised learning has shown to be very useful learning video representations without any annotations assigned to it. This thesis proposes a self-supervised method for learning event representations of unconstrained complex activity videos. These are sequences of images with high temporal resolution and with very small visual variance between events, with a clear semantic differentiation for humans. The assumption underlying the proposed model is that a video can be represented by a graph that encodes both semantic and temporal similarity between events. Our method follows two steps: first, meaningful initial features are extracted by a spatio-temporal backbone neural network trained on a self-supervised contrastive task. Then, starting with this initial embedding, low-dimensional graph-based event representation features are iteratively learned jointly with its underlying graph structure. The main contribution in this work is to learn a function parameterized by a graph neural network that learns graph-based event feature representations by exploiting the semantic and temporal relatedness through a fully end-to-end self-supervised trainable approach. Experiments were performed in the challenging \textit{Breakfast Action Dataset} and we show that the proposed approach leads to an effective low-dimensional feature representation of the input data, suitable for the downstream task of event segmentation. Moreover, we show that the presented method, followed by a downstream clustering task, achieves on par state-of-the-art metrics on video segmentation of complex activity videos
Learning graph-based event representations for unconstrained video segmentation
Trabajo fin de máster presentado en la Universidad Politécnica de Cataluña, Master of Science in Advanced Mathematics and Mathematical Engineering (MAMME).--2021-05-18Recent research has shown that, in particular domains, unsupervised learning algorithms are achieving on par, or even better performance than fully supervised algorithms, avoiding the need of human labelled data. The division of a video into events has been an active research topic through unsupervised algorithms, exploiting relations in the video itself for a temporal segmentation task. In particular, self-supervised learning has shown to be very useful learning video representations without any annotations assigned to it. This thesis proposes a self-supervised method for learning event representations of unconstrained complex activity videos. These are sequences of images with high temporal resolution and with very small visual variance between events, with a clear semantic differentiation for humans. The assumption underlying the proposed model is that a video can be represented by a graph that encodes both semantic and temporal similarity between events. Our method follows two steps: first, meaningful initial features are extracted by a spatio-temporal backbone neural network trained on a self-supervised contrastive task. Then, starting with this initial embedding, low-dimensional graph-based event representation features are iteratively learned jointly with its underlying graph structure. The main contribution in this work is to learn a function parameterized by a graph neural network that learns graph-based event feature representations by exploiting the semantic and temporal relatedness through a fully end-to-end self-supervised trainable approach. Experiments were performed in the challenging \textit{Breakfast Action Dataset} and we show that the proposed approach leads to an effective low-dimensional feature representation of the input data, suitable for the downstream task of event segmentation. Moreover, we show that the presented method, followed by a downstream clustering task, achieves on par state-of-the-art metrics on video segmentation of complex activity video
Reconeixement facial 3D amb cámeras d'automoció estèreo
The automotive sector is experiencing a rapid change that will influence the way in which we will use cars. One of the main concerns of this new wave of automotive research regards the security of future vehicles, being driver authentication a key factor that has not been solved yet. This dissertation develops a face recognition software which seeks to contribute to resolve this problem. Face recognition has proved to be the preferred method of bio-metric verification in different sectors such as the phone industry. Nowadays, most of facial recognition software in the market can easily be fooled by a printed picture of a face due to the lack of depth information. The developed software creates a 3D map of the face with the help of a stereo setup, offering a higher level of security than traditional recognition software. Analysis of the person's identity and 3D model are processed through Deep Convolutional Neural Networks, providing a secure real-time face authentication method for the future automotive industry.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística
Reconeixement facial 3D amb cámeras d'automoció estèreo
The automotive sector is experiencing a rapid change that will influence the way in which we will use cars. One of the main concerns of this new wave of automotive research regards the security of future vehicles, being driver authentication a key factor that has not been solved yet. This dissertation develops a face recognition software which seeks to contribute to resolve this problem. Face recognition has proved to be the preferred method of bio-metric verification in different sectors such as the phone industry. Nowadays, most of facial recognition software in the market can easily be fooled by a printed picture of a face due to the lack of depth information. The developed software creates a 3D map of the face with the help of a stereo setup, offering a higher level of security than traditional recognition software. Analysis of the person's identity and 3D model are processed through Deep Convolutional Neural Networks, providing a secure real-time face authentication method for the future automotive industry.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística