1,165 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    Understanding the Role of Dynamics in Brain Networks: Methods, Theory and Application

    Get PDF
    The brain is inherently a dynamical system whose networks interact at multiple spatial and temporal scales. Understanding the functional role of these dynamic interactions is a fundamental question in neuroscience. In this research, we approach this question through the development of new methods for characterizing brain dynamics from real data and new theories for linking dynamics to function. We perform our study at two scales: macro (at the level of brain regions) and micro (at the level of individual neurons). In the first part of this dissertation, we develop methods to identify the underlying dynamics at macro-scale that govern brain networks during states of health and disease in humans. First, we establish an optimization framework to actively probe connections in brain networks when the underlying network dynamics are changing over time. Then, we extend this framework to develop a data-driven approach for analyzing neurophysiological recordings without active stimulation, to describe the spatiotemporal structure of neural activity at different timescales. The overall goal is to detect how the dynamics of brain networks may change within and between particular cognitive states. We present the efficacy of this approach in characterizing spatiotemporal motifs of correlated neural activity during the transition from wakefulness to general anesthesia in functional magnetic resonance imaging (fMRI) data. Moreover, we demonstrate how such an approach can be utilized to construct an automatic classifier for detecting different levels of coma in electroencephalogram (EEG) data. In the second part, we study how ongoing function can constraint dynamics at micro-scale in recurrent neural networks, with particular application to sensory systems. Specifically, we develop theoretical conditions in a linear recurrent network in the presence of both disturbance and noise for exact and stable recovery of dynamic sparse stimuli applied to the network. We show how network dynamics can affect the decoding performance in such systems. Moreover, we formulate the problem of efficient encoding of an afferent input and its history in a nonlinear recurrent network. We show that a linear neural network architecture with a thresholding activation function is emergent if we assume that neurons optimize their activity based on a particular cost function. Such an architecture can enable the production of lightweight, history-sensitive encoding schemes

    Development of High-speed Photoacoustic Imaging technology and Its Applications in Biomedical Research

    Get PDF
    Photoacoustic (PA) tomography (PAT) is a novel imaging modality that combines the fine lateral resolution from optical imaging and the deep penetration from ultrasonic imaging, and provides rich optical-absorption–based images. PAT has been widely used in extracting structural and functional information from both ex vivo tissue samples to in vivo animals and humans with different length scales by imaging various endogenous and exogenous contrasts at the ultraviolet to infrared spectrum. For example, hemoglobin in red blood cells is of particular interest in PAT since it is one of the dominant absorbers in tissue at the visible wavelength.The main focus of this dissertation is to develop high-speed PA microscopy (PAM) technologies. Novel optical scanning, ultrasonic detection, and laser source techniques are introduced in this dissertation to advance the performance of PAM systems. These upgrades open up new avenues for PAM to be applicable to address important biomedical challenges and enable fundamental physiological studies.First, we investigated the feasibility of applying high-speed PAM to the detection and imaging of circulating tumor cells (CTCs) in melanoma models, which can provide valuable information about a tumor’s metastasis potentials. We probed the melanoma CTCs at the near-infrared wavelength of 1064 nm, where the melanosomes absorb more strongly than hemoglobin. Our high-speed PA flow cytography system successfully imaged melanoma CTCs in travelling trunk vessels. We also developed a concurrent laser therapy device, hardware-triggered by the CTC signal, to photothermally lyse the CTC on the spot in an effort to inhibit metastasis.Next, we addressed the detection sensitivity issue in the previous study. We employed the stimulated Raman scattering (SRS) effect to construct a high-repetition-rate Raman laser at 658 nm, where the contrast between a melanoma CTC and the blood background is near the highest. Our upgraded PA flow cytography successfully captured sequential images of CTCs in mouse melanoma xenograft model, with a significantly improved contrast-to-noise ratio compared to our previous results. This technology is readily translatable to the clinics to extract the information of a tumor’s metastasis risks.We extended the Raman laser technology to the field of brain functional studies. We developed a MEMS (micro-electromechanical systems) scanner for fast optical scanning, and incorporated it to a dual-wavelength functional PAM (fPAM) for high-speed imaging of cerebral hemodynamics in mouse. This fPAM system successfully imaged transient changes in blood oxygenation at cerebral micro-vessels in response to brief somatic stimulations. This fPAM technology is a powerful tool for neurological studies.Finally, we explored some approaches of reducing the size the PAM imaging head in an effort to translate our work to the field of wearable biometric monitors. To miniaturize the ultrasonic detection device, we fabricated a thin-film optically transparent piezoelectric detector for detecting PA waves. This technology could enable longitudinal studies on free-moving animals through a wearable version of PAM

    Text detection and recognition in natural scene images

    Get PDF
    This thesis addresses the problem of end-to-end text detection and recognition in natural scene images based on deep neural networks. Scene text detection and recognition aim to find regions in an image that are considered as text by human beings, generate a bounding box for each word and output a corresponding sequence of characters. As a useful task in image analysis, scene text detection and recognition attract much attention in computer vision field. In this thesis, we tackle this problem by taking advantage of the success in deep learning techniques. Car license plates can be viewed as a spacial case of scene text, as they both consist of characters and appear in natural scenes. Nevertheless, they have their respective specificities. During the research progress, we start from car license plate detection and recognition. Then we extend the methods to general scene text, with additional ideas proposed. For both tasks, we develop two approaches respectively: a stepwise one and an integrated one. Stepwise methods tackle text detection and recognition step by step by respective models; while integrated methods handle both text detection and recognition simultaneously via one model. All approaches are based on the powerful deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), considering the tremendous breakthroughs they brought into the computer vision community. To begin with, a stepwise framework is proposed to tackle text detection and recognition, with its application to car license plates and general scene text respectively. A character CNN classifier is well trained to detect characters from an image in a sliding window manner. The detected characters are then grouped together as license plates or text lines according to some heuristic rules. A sequence labeling based method is proposed to recognize the whole license plate or text line without character level segmentation. On the basis of the sequence labeling based recognition method, to accelerate the processing speed, an integrated deep neural network is then proposed to address car license plate detection and recognition concurrently. It integrates both CNNs and RNNs in one network, and can be trained end-to-end. Both car license plate bounding boxes and their labels are generated in a single forward evaluation of the network. The whole process involves no heuristic rule, and avoids intermediate procedures like image cropping or feature recalculation, which not only prevents error accumulation, but also reduces computation burden. Lastly, the unified network is extended to simultaneous general text detection and recognition in natural scene. In contrast to the one for car license plates, some innovations are proposed to accommodate the special characteristics of general text. A varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted for word recognition. It is expected that a character-level language model can be learnt in this manner. The whole framework can be trained end-to-end, requiring only images, the ground-truth bounding boxes and text labels. Through end-to-end training, the learned features can be more discriminative, which improves the overall performance. The convolutional features are calculated only once and shared by both detection and recognition, which saves the processing time. The proposed method has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
    • …
    corecore