29,390 research outputs found

    Learning long-range spatial dependencies with horizontal gated-recurrent units

    Full text link
    Progress in deep learning has spawned great successes in many engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural networks, are now approaching -- and sometimes even surpassing -- human accuracy on a variety of visual recognition tasks. Here, however, we show that these neural networks and their recent extensions struggle in recognition tasks where co-dependent visual features must be detected over long spatial ranges. We introduce the horizontal gated-recurrent unit (hGRU) to learn intrinsic horizontal connections -- both within and across feature columns. We demonstrate that a single hGRU layer matches or outperforms all tested feedforward hierarchical baselines including state-of-the-art architectures which have orders of magnitude more free parameters. We further discuss the biological plausibility of the hGRU in comparison to anatomical data from the visual cortex as well as human behavioral data on a classic contour detection task.Comment: Published at NeurIPS 2018 https://papers.nips.cc/paper/7300-learning-long-range-spatial-dependencies-with-horizontal-gated-recurrent-unit

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Learning to Recognize Actions from Limited Training Examples Using a Recurrent Spiking Neural Model

    Full text link
    A fundamental challenge in machine learning today is to build a model that can learn from few examples. Here, we describe a reservoir based spiking neural model for learning to recognize actions with a limited number of labeled videos. First, we propose a novel encoding, inspired by how microsaccades influence visual perception, to extract spike information from raw video data while preserving the temporal correlation across different frames. Using this encoding, we show that the reservoir generalizes its rich dynamical activity toward signature action/movements enabling it to learn from few training examples. We evaluate our approach on the UCF-101 dataset. Our experiments demonstrate that our proposed reservoir achieves 81.3%/87% Top-1/Top-5 accuracy, respectively, on the 101-class data while requiring just 8 video examples per class for training. Our results establish a new benchmark for action recognition from limited video examples for spiking neural models while yielding competetive accuracy with respect to state-of-the-art non-spiking neural models.Comment: 13 figures (includes supplementary information

    Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization

    Full text link
    State-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the actions being searched for. To address this need, we propose the new problem of action spotting in video, which we define as finding a specific action in a video while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in spotting and finding action instances in video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the action spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but it also accurately finds human activities with 30.8% mAP.Comment: Accepted to ECCV 201

    Speech enhancement using deep learning

    Get PDF
    This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results. As a first step, a Signal Analysis Module has been implemented in order to calculate the magnitude and phase of each audio file in the database. The signal is represented into its magnitude and its phase, where the magnitude is modified by the neural network, and then it is reconstructed with the original phase. The implementation of the Neural Networks is divided into two stages.The first stage was the implementation of a Speech Activity Detection Deep Neural Network (SAD-DNN). The magnitude previously calculated, applied to the noisy data, will train the SAD-DNN in order to classify each frame in speech or non-speech. This classification is useful for the network that does the final cleaning. The Speech Activity Detection Deep Neural Network is followed by a Denoising Auto-Encoder (DAE). The magnitude and the label speech or non-speech will be the input of this second Deep Neural Network in charge of denoising the speech signal. The first stage is also optimized to be adequate for the final task in this second stage. In order to do the training, Neural Networks require datasets. In this project the Timit corpus [9] has been used as dataset for the clean voice (target) and the QUT-NOISE TIMIT corpus[4] as noisy dataset (source). Finally, Signal Synthesis Module reconstructs the clean speech signal from the enhanced magnitudes and the phase. In the end, the results provided by the system have been analysed using both objective and subjective measures.Esta tesis explora la posibilidad de conseguir mejorar señales de voz con ruido utilizando Redes Neuronales Profundas. La mejora de señales es un problema clásico del procesado de señal, pero recientemente se esta investigando con deep learning, ya que son técnicas que han dado resultados muy satisfactorios en muchas tareas del procesado de señal. Como primer paso, se ha implementado un Módulo de Análisis de Señal con el objetivo de extraer el módulo y fase de cada archivo de voz de la base de datos. La señal se representa en módulo y fase, donde el módulo se modifica con la red neuronal y posteriormente se reconstruye con la fase original. La implementación de la Red Neuronal consta de dos etapas. En la primera etapa se implementó una Red Neuronal de Detección de Actividad de Voz. El módulo previamente calculado, aplicado a los datos con ruido, se utiliza como entrada para entrenar esta red, de manera que se consigue clasificar cada trama en voz o no voz. Esta clasificación es útil para la red que se encarga de hacer la limpieza. A continuación de la Red Neuronal de Detección de Actividad de Voz se implementa otra, con el objetivo de eliminar el ruido. El módulo junto con la etiqueta obtenida en la red anterior serán la entrada de esta nueva red. En esta segunda etapa también se optimiza la primera para adaptarse a la tarea final. Las Redes Neuronales requieren bases de datos para el entrenamiento. En este proyecto se ha utilizado el Timit corpus [9] como base de datos de voz limpia (objetivo) y el QUT-NOISE TIMIT [4] como base de datos con ruido (fuente). A continuación, el Módulo de Síntesis de Señal reconstruye la señal de voz limpia a partir del módulo sin ruido y la fase original.Aquesta tesis explora la possibilitat d'aconseguir millorar senyals de veu amb soroll, utilitzant Xarxes Neuronals Profundes. La millora de senyals és un problema clàssic del processat de senyal, però recentment s'està investigant amb deep learning, ja que són tècniques que han donat resultats molt satisfactoris en moltes tasques de processament de veu. Com a primer pas, s'ha implementat un Mòdul d'Anàlisi de Senyal amb l'objectiu d'extreure el mòdul i la fase de cada arxiu d'àudio de la base de dades. El senyal es representa en mòdul i fase, on el mòdul es modifica amb la xarxa neuronal i posteriorment es reconstrueix amb la fase original. La implementació de les Xarxes Neuronals consta de dues etapes. En la primera etapa es va implementar una Xarxa Neuronal de Detecció d'Activitat de Veu. El mòdul prèviament calculat, aplicat a les dades amb soroll, s'utilitza com entrada per entrenar aquesta xarxa, de manera que s'aconsegueix classificar cada trama en veu o no veu. Aquesta classificació és útil per la xarxa que fa la neteja final. A continuació de la Xarxa Neuronal de Detecció d'Activitat de Veu s'implementa una altra amb l'objectiu d'eliminar el soroll. El mòdul, juntament amb la etiqueta obtinguda en la xarxa anterior, seran l'entrada d'aquesta nova xarxa. En aquesta segona etapa també s'optimitza la primera per adaptar-se a la tasca final. Les Xarxes Neuronals requereixen bases de dades per fer l'entrenament. En aquest projecte s'ha utilitzat el Timit corpus [9] com a base de dades de veu neta (objectiu) i el QUT-NOISE TIMIT[4] com a base de dades amb soroll (font). A continuació, el Mòdul de Síntesi de Senyal reconstrueix el senyal de veu net a partir del mòdul netejat i la fase original. Finalment, els resultats obtinguts del sistema van ser analitzats utilitzant mesures objectives i subjectives

    Classification of Occluded Objects using Fast Recurrent Processing

    Full text link
    Recurrent neural networks are powerful tools for handling incomplete data problems in computer vision, thanks to their significant generative capabilities. However, the computational demand for these algorithms is too high to work in real time, without specialized hardware or software solutions. In this paper, we propose a framework for augmenting recurrent processing capabilities into a feedforward network without sacrificing much from computational efficiency. We assume a mixture model and generate samples of the last hidden layer according to the class decisions of the output layer, modify the hidden layer activity using the samples, and propagate to lower layers. For visual occlusion problem, the iterative procedure emulates feedforward-feedback loop, filling-in the missing hidden layer activity with meaningful representations. The proposed algorithm is tested on a widely used dataset, and shown to achieve 2×\times improvement in classification accuracy for occluded objects. When compared to Restricted Boltzmann Machines, our algorithm shows superior performance for occluded object classification.Comment: arXiv admin note: text overlap with arXiv:1409.8576 by other author

    FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices

    Full text link
    Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally affect performance attributes of interest, such as execution time. Rather, extreme run-time nonlinearities exist over the network configuration space. Hence, we propose a novel framework, called FastDeepIoT, that uncovers the non-linear relation between neural network structure and execution time, then exploits that understanding to find network configurations that significantly improve the trade-off between execution time and accuracy on mobile and embedded devices. FastDeepIoT makes two key contributions. First, FastDeepIoT automatically learns an accurate and highly interpretable execution time model for deep neural networks on the target device. This is done without prior knowledge of either the hardware specifications or the detailed implementation of the used deep learning library. Second, FastDeepIoT informs a compression algorithm how to minimize execution time on the profiled device without impacting accuracy. We evaluate FastDeepIoT using three different sensing-related tasks on two mobile devices: Nexus 5 and Galaxy Nexus. FastDeepIoT further reduces the neural network execution time by 48%48\% to 78%78\% and energy consumption by 37%37\% to 69%69\% compared with the state-of-the-art compression algorithms.Comment: Accepted by SenSys '1
    corecore