10 research outputs found

    Leveraging triplet loss for unsupervised action segmentation

    Get PDF
    © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.This work was supported by the project PID2019-110977GA-I00 funded by MCIN/ AEI/ 10.13039/501100011033 and by ”ESF Investing in your future”Peer ReviewedPostprint (author's final draft

    Efficient Keyword Spotting by capturing long-range interactions with Temporal Lambda Networks

    Get PDF
    Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel \textit{ResNet}-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15) counterparts while being up to 100 times faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and therefore, we unravel further research of new interfaces based on this architecture.Comment: speech recognition, keyword spotting, lambda network

    Efficient keyword spotting by capturing long-range interactions with temporal lambda networks

    Get PDF
    Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel ResNet-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (ResNet15) counterparts while being up to 100× faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and therefore, we unravel further research of new interfaces based on this architecture.Peer ReviewedPostprint (author's final draft

    Learning graph-based event representations for unconstrained video segmentation

    No full text
    Institut de Robòtica i Informàtica IndustrialRecent research has shown that, in particular domains, unsupervised learning algorithms are achieving on par, or even better performance than fully supervised algorithms, avoiding the need of human labelled data. The division of a video into events has been an active research topic through unsupervised algorithms, exploiting relations in the video itself for a temporal segmentation task. In particular, self-supervised learning has shown to be very useful learning video representations without any annotations assigned to it. This thesis proposes a self-supervised method for learning event representations of unconstrained complex activity videos. These are sequences of images with high temporal resolution and with very small visual variance between events, with a clear semantic differentiation for humans. The assumption underlying the proposed model is that a video can be represented by a graph that encodes both semantic and temporal similarity between events. Our method follows two steps: first, meaningful initial features are extracted by a spatio-temporal backbone neural network trained on a self-supervised contrastive task. Then, starting with this initial embedding, low-dimensional graph-based event representation features are iteratively learned jointly with its underlying graph structure. The main contribution in this work is to learn a function parameterized by a graph neural network that learns graph-based event feature representations by exploiting the semantic and temporal relatedness through a fully end-to-end self-supervised trainable approach. Experiments were performed in the challenging \textit{Breakfast Action Dataset} and we show that the proposed approach leads to an effective low-dimensional feature representation of the input data, suitable for the downstream task of event segmentation. Moreover, we show that the presented method, followed by a downstream clustering task, achieves on par state-of-the-art metrics on video segmentation of complex activity videos

    Learning graph-based event representations for unconstrained video segmentation

    No full text
    Trabajo fin de máster presentado en la Universidad Politécnica de Cataluña, Master of Science in Advanced Mathematics and Mathematical Engineering (MAMME).--2021-05-18Recent research has shown that, in particular domains, unsupervised learning algorithms are achieving on par, or even better performance than fully supervised algorithms, avoiding the need of human labelled data. The division of a video into events has been an active research topic through unsupervised algorithms, exploiting relations in the video itself for a temporal segmentation task. In particular, self-supervised learning has shown to be very useful learning video representations without any annotations assigned to it. This thesis proposes a self-supervised method for learning event representations of unconstrained complex activity videos. These are sequences of images with high temporal resolution and with very small visual variance between events, with a clear semantic differentiation for humans. The assumption underlying the proposed model is that a video can be represented by a graph that encodes both semantic and temporal similarity between events. Our method follows two steps: first, meaningful initial features are extracted by a spatio-temporal backbone neural network trained on a self-supervised contrastive task. Then, starting with this initial embedding, low-dimensional graph-based event representation features are iteratively learned jointly with its underlying graph structure. The main contribution in this work is to learn a function parameterized by a graph neural network that learns graph-based event feature representations by exploiting the semantic and temporal relatedness through a fully end-to-end self-supervised trainable approach. Experiments were performed in the challenging \textit{Breakfast Action Dataset} and we show that the proposed approach leads to an effective low-dimensional feature representation of the input data, suitable for the downstream task of event segmentation. Moreover, we show that the presented method, followed by a downstream clustering task, achieves on par state-of-the-art metrics on video segmentation of complex activity video

    Reconeixement facial 3D amb cámeras d'automoció estèreo

    No full text
    The automotive sector is experiencing a rapid change that will influence the way in which we will use cars. One of the main concerns of this new wave of automotive research regards the security of future vehicles, being driver authentication a key factor that has not been solved yet. This dissertation develops a face recognition software which seeks to contribute to resolve this problem. Face recognition has proved to be the preferred method of bio-metric verification in different sectors such as the phone industry. Nowadays, most of facial recognition software in the market can easily be fooled by a printed picture of a face due to the lack of depth information. The developed software creates a 3D map of the face with the help of a stereo setup, offering a higher level of security than traditional recognition software. Analysis of the person's identity and 3D model are processed through Deep Convolutional Neural Networks, providing a secure real-time face authentication method for the future automotive industry.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística

    Reconeixement facial 3D amb cámeras d'automoció estèreo

    No full text
    The automotive sector is experiencing a rapid change that will influence the way in which we will use cars. One of the main concerns of this new wave of automotive research regards the security of future vehicles, being driver authentication a key factor that has not been solved yet. This dissertation develops a face recognition software which seeks to contribute to resolve this problem. Face recognition has proved to be the preferred method of bio-metric verification in different sectors such as the phone industry. Nowadays, most of facial recognition software in the market can easily be fooled by a printed picture of a face due to the lack of depth information. The developed software creates a 3D map of the face with the help of a stereo setup, offering a higher level of security than traditional recognition software. Analysis of the person's identity and 3D model are processed through Deep Convolutional Neural Networks, providing a secure real-time face authentication method for the future automotive industry.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística.El sector de l'automòbil està experimentant un canvi que influirà en la forma que utilitzarem els automòbils. Una de les principals preocupacions d'aquesta nova onada d'investigació automobilística es la seguretat dels futurs vehicles, sent l'autenticació del conductor un factor clau que encara no s'ha resolt. Aquesta dissertació desenvolupa un sistema de reconeixement facial que busca contribuir a resoldre aquest problema. El reconeixement facial ha demostrat ser el mètode preferit per a la verificació biomètrica en diferents sectors, com la indústria telefònica. Actualment la majoria de sistemes de reconeixement facial poden ser fàcilment falsejats per una foto d'una cara a causa de la falta d'informació de profunditat. El programa desenvolupat crea un mapa 3D de la cara amb l'ajuda d'una configuració estèreo, que ofereix un nivell de seguretat més elevat que els programes de reconeixement tradicionals. L'anàlisi de la identitat de la persona i el model 3D es processen a través de Xarxes Neuronals Convolucioanls, proporcionant un mètode segur d'autenticació facial en temps real per a la futura indústria automobilística

    Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models

    No full text

    Author Correction: Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models (<em>Nature Biotechnology</em>, (2023), 41, 3, (399-408), 10.1038/s41587-022-01520-x)

    No full text
    corecore