1,165 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Understanding the Role of Dynamics in Brain Networks: Methods, Theory and Application
The brain is inherently a dynamical system whose networks interact at multiple spatial and temporal scales. Understanding the functional role of these dynamic interactions is a fundamental question in neuroscience. In this research, we approach this question through the development of new methods for characterizing brain dynamics from real data and new theories for linking dynamics to function. We perform our study at two scales: macro (at the level of brain regions) and micro (at the level of individual neurons).
In the first part of this dissertation, we develop methods to identify the underlying dynamics at macro-scale that govern brain networks during states of health and disease in humans. First, we establish an optimization framework to actively probe connections in brain networks when the underlying network dynamics are changing over time. Then, we extend this framework to develop a data-driven approach for analyzing neurophysiological recordings without active stimulation, to describe the spatiotemporal structure of neural activity at different timescales. The overall goal is to detect how the dynamics of brain networks may change within and between particular cognitive states. We present the efficacy of this approach in characterizing spatiotemporal motifs of correlated neural activity during the transition from wakefulness to general anesthesia in functional magnetic resonance imaging (fMRI) data. Moreover, we demonstrate how such an approach can be utilized to construct an automatic classifier for detecting different levels of coma in electroencephalogram (EEG) data.
In the second part, we study how ongoing function can constraint dynamics at micro-scale in recurrent neural networks, with particular application to sensory systems. Specifically, we develop theoretical conditions in a linear recurrent network in the presence of both disturbance and noise for exact and stable recovery of dynamic sparse stimuli applied to the network. We show how network dynamics can affect the decoding performance in such systems. Moreover, we formulate the problem of efficient encoding of an afferent input and its history in a nonlinear recurrent network. We show that a linear neural network architecture with a thresholding activation function is emergent if we assume that neurons optimize their activity based on a particular cost function. Such an architecture can enable the production of lightweight, history-sensitive encoding schemes
Development of High-speed Photoacoustic Imaging technology and Its Applications in Biomedical Research
Photoacoustic (PA) tomography (PAT) is a novel imaging modality that combines the fine lateral resolution from optical imaging and the deep penetration from ultrasonic imaging, and provides rich optical-absorption–based images. PAT has been widely used in extracting structural and functional information from both ex vivo tissue samples to in vivo animals and humans with different length scales by imaging various endogenous and exogenous contrasts at the ultraviolet to infrared spectrum. For example, hemoglobin in red blood cells is of particular interest in PAT since it is one of the dominant absorbers in tissue at the visible wavelength.The main focus of this dissertation is to develop high-speed PA microscopy (PAM) technologies. Novel optical scanning, ultrasonic detection, and laser source techniques are introduced in this dissertation to advance the performance of PAM systems. These upgrades open up new avenues for PAM to be applicable to address important biomedical challenges and enable fundamental physiological studies.First, we investigated the feasibility of applying high-speed PAM to the detection and imaging of circulating tumor cells (CTCs) in melanoma models, which can provide valuable information about a tumor’s metastasis potentials. We probed the melanoma CTCs at the near-infrared wavelength of 1064 nm, where the melanosomes absorb more strongly than hemoglobin. Our high-speed PA flow cytography system successfully imaged melanoma CTCs in travelling trunk vessels. We also developed a concurrent laser therapy device, hardware-triggered by the CTC signal, to photothermally lyse the CTC on the spot in an effort to inhibit metastasis.Next, we addressed the detection sensitivity issue in the previous study. We employed the stimulated Raman scattering (SRS) effect to construct a high-repetition-rate Raman laser at 658 nm, where the contrast between a melanoma CTC and the blood background is near the highest. Our upgraded PA flow cytography successfully captured sequential images of CTCs in mouse melanoma xenograft model, with a significantly improved contrast-to-noise ratio compared to our previous results. This technology is readily translatable to the clinics to extract the information of a tumor’s metastasis risks.We extended the Raman laser technology to the field of brain functional studies. We developed a MEMS (micro-electromechanical systems) scanner for fast optical scanning, and incorporated it to a dual-wavelength functional PAM (fPAM) for high-speed imaging of cerebral hemodynamics in mouse. This fPAM system successfully imaged transient changes in blood oxygenation at cerebral micro-vessels in response to brief somatic stimulations. This fPAM technology is a powerful tool for neurological studies.Finally, we explored some approaches of reducing the size the PAM imaging head in an effort to translate our work to the field of wearable biometric monitors. To miniaturize the ultrasonic detection device, we fabricated a thin-film optically transparent piezoelectric detector for detecting PA waves. This technology could enable longitudinal studies on free-moving animals through a wearable version of PAM
Text detection and recognition in natural scene images
This thesis addresses the problem of end-to-end text detection and recognition in
natural scene images based on deep neural networks. Scene text detection and recognition
aim to find regions in an image that are considered as text by human beings,
generate a bounding box for each word and output a corresponding sequence of
characters. As a useful task in image analysis, scene text detection and recognition
attract much attention in computer vision field. In this thesis, we tackle this problem
by taking advantage of the success in deep learning techniques.
Car license plates can be viewed as a spacial case of scene text, as they both consist
of characters and appear in natural scenes. Nevertheless, they have their respective
specificities. During the research progress, we start from car license plate detection
and recognition. Then we extend the methods to general scene text, with additional
ideas proposed.
For both tasks, we develop two approaches respectively: a stepwise one and
an integrated one. Stepwise methods tackle text detection and recognition step by
step by respective models; while integrated methods handle both text detection and
recognition simultaneously via one model. All approaches are based on the powerful
deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), considering the tremendous breakthroughs they brought into the computer
vision community.
To begin with, a stepwise framework is proposed to tackle text detection and
recognition, with its application to car license plates and general scene text respectively.
A character CNN classifier is well trained to detect characters from an image
in a sliding window manner. The detected characters are then grouped together as
license plates or text lines according to some heuristic rules. A sequence labeling
based method is proposed to recognize the whole license plate or text line without
character level segmentation.
On the basis of the sequence labeling based recognition method, to accelerate the
processing speed, an integrated deep neural network is then proposed to address
car license plate detection and recognition concurrently. It integrates both CNNs
and RNNs in one network, and can be trained end-to-end. Both car license plate
bounding boxes and their labels are generated in a single forward evaluation of the
network. The whole process involves no heuristic rule, and avoids intermediate
procedures like image cropping or feature recalculation, which not only prevents
error accumulation, but also reduces computation burden.
Lastly, the unified network is extended to simultaneous general text detection and
recognition in natural scene. In contrast to the one for car license plates, some innovations
are proposed to accommodate the special characteristics of general text. A
varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted
for word recognition. It is expected that a character-level language model can be
learnt in this manner. The whole framework can be trained end-to-end, requiring
only images, the ground-truth bounding boxes and text labels. Through end-to-end
training, the learned features can be more discriminative, which improves the overall
performance. The convolutional features are calculated only once and shared by both
detection and recognition, which saves the processing time. The proposed method
has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
- …