2,564 research outputs found

    Attention Mechanisms for Object Recognition with Event-Based Cameras

    Full text link
    Event-based cameras are neuromorphic sensors capable of efficiently encoding visual information in the form of sparse sequences of events. Being biologically inspired, they are commonly used to exploit some of the computational and power consumption benefits of biological vision. In this paper we focus on a specific feature of vision: visual attention. We propose two attentive models for event based vision: an algorithm that tracks events activity within the field of view to locate regions of interest and a fully-differentiable attention procedure based on DRAW neural model. We highlight the strengths and weaknesses of the proposed methods on four datasets, the Shifted N-MNIST, Shifted MNIST-DVS, CIFAR10-DVS and N-Caltech101 collections, using the Phased LSTM recognition network as a baseline reference model obtaining improvements in terms of both translation and scale invariance.Comment: WACV2019 camera-ready submissio

    Transfer learning: bridging the gap between deep learning and domain-specific text mining

    Get PDF
    Inspired by the success of deep learning techniques in Natural Language Processing (NLP), this dissertation tackles the domain-specific text mining problems for which the generic deep learning approaches would fail. More specifically, the domain-specific problems are: (1) success prediction in crowdfunding, (2) variants identification in biomedical literature, and (3) text data augmentation for domains with low-resources. In the first part, transfer learning in a multimodal perspective is utilized to facilitate solving the project success prediction on the crowdfunding application. Even though the information in a project profile can be of different modalities such as text, images, and metadata, most existing prediction approaches leverage only the text modality. It is promising to utilize the visual images in project profiles to find out how images could contribute to the success prediction. An advanced neural network scheme is designed and evaluated combining information learned from different modalities for project success prediction. In the second part, transfer learning is combined with deep learning techniques to solve genomic variants Named Entity Recognition (NER) problems in biomedical literature. Most of the advanced generic NER algorithms can fail due to the restricted training corpus. However, those generic deep learning algorithms are capable of learning from a canonical corpus, without any effort on feature engineering. This work aims to build an end-to-end deep learning approach to transfer the domain-specific knowledge to those advanced generic NER algorithms, addressing the challenges in low-resource training and requiring neither hand-crafted features nor post-processing rules. For the last part, transfer learning with knowledge distillation and active learning are utilized to solve text augmentation for domains with low-resources. Most of the recent text augmentation methods heavily rely on large external resources. This work is dedicates to solving the text augmentation problem adaptively and consistently with minimal resources for token-level tasks like NER. The solution can also assure the reliability of machine labels for noisy data and can enhance training consistency with noisy labels. All the works are evaluated on different domain-specific benchmarks, respectively. Experimental results demonstrate the effectiveness of those proposed methods. The advantages also indicate promising potential for transfer learning in domain-specific applications

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Machine Learning for Human Activity Detection in Smart Homes

    Get PDF
    Recognizing human activities in domestic environments from audio and active power consumption sensors is a challenging task since on the one hand, environmental sound signals are multi-source, heterogeneous, and varying in time and on the other hand, the active power consumption varies significantly for similar type electrical appliances. Many systems have been proposed to process environmental sound signals for event detection in ambient assisted living applications. Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. A part of this thesis contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: 1) which features, and which classifiers are most suitable in the presence of background noise? 2) what is the effect of signal duration on recognition accuracy? 3) how do the SNR and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D CNN using mel-spectrogram energies and mel-spectrograms images, as inputs, respectively and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems and validated the performance of our algorithms on public datasets (Google Brain/TensorFlow Speech Recognition Challenge and the 2017 Detection and Classification of Acoustic Scenes and Events Challenge). Regarding the problem of the energy-based human activity recognition in a household environment, machine learning techniques to infer the state of household appliances from their energy consumption data are applied and rule-based scenarios that exploit these states to detect human activity are used. Since most activities within a house are related with the operation of an electrical appliance, this unimodal approach has a significant advantage using inexpensive smart plugs and smart meters for each appliance. This part of the thesis proposes the use of unobtrusive and easy-install tools (smart plugs) for data collection and a decision engine that combines energy signal classification using dominant classifiers (compared in advanced with grid search) and a probabilistic measure for appliance usage. It helps preserving the privacy of the resident, since all the activities are stored in a local database. DNNs received great research interest in the field of computer vision. In this thesis we adapted different architectures for the problem of human activity recognition. We analyze the quality of the extracted features, and more specifically how model architectures and parameters affect the ability of the automatically extracted features from DNNs to separate activity classes in the final feature space. Additionally, the architectures that we applied for our main problem were also applied to text classification in which we consider the input text as an image and apply 2D CNNs to learn the local and global semantics of the sentences from the variations of the visual patterns of words. This work helps as a first step of creating a dialogue agent that would not require any natural language preprocessing. Finally, since in many domestic environments human speech is present with other environmental sounds, we developed a Convolutional Recurrent Neural Network, to separate the sound sources and applied novel post-processing filters, in order to have an end-to-end noise robust system. Our algorithm ranked first in the Apollo-11 Fearless Steps Challenge.Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 676157, project ACROSSIN

    Video Based Automatic Speech Recognition Using Neural Networks

    Get PDF
    Neural network approaches have become popular in the field of automatic speech recognition (ASR). Most ASR methods use audio data to classify words. Lip reading ASR techniques utilize only video data, which compensates for noisy environments where audio may be compromised. A comprehensive approach, including the vetting of datasets and development of a preprocessing chain, to video-based ASR is developed. This approach will be based on neural networks, namely 3D convolutional neural networks (3D-CNN) and Long short-term memory (LSTM). These types of neural networks are designed to take in temporal data such as videos. Various combinations of different neural network architecture and preprocessing techniques are explored. The best performing neural network architecture, a CNN with bidirectional LSTM, compares favorably against recent works on video-based ASR
    corecore