    Blind Restoration of Real-World Audio by 1D Operational GANs

    Objective: Despite numerous studies proposed for audio restoration in the literature, most of them focus on an isolated restoration problem such as denoising or dereverberation, ignoring other artifacts. Moreover, assuming a noisy or reverberant environment with limited number of fixed signal-to-distortion ratio (SDR) levels is a common practice. However, real-world audio is often corrupted by a blend of artifacts such as reverberation, sensor noise, and background audio mixture with varying types, severities, and duration. In this study, we propose a novel approach for blind restoration of real-world audio signals by Operational Generative Adversarial Networks (Op-GANs) with temporal and spectral objective metrics to enhance the quality of restored audio signal regardless of the type and severity of each artifact corrupting it. Methods: 1D Operational-GANs are used with generative neuron model optimized for blind restoration of any corrupted audio signal. Results: The proposed approach has been evaluated extensively over the benchmark TIMIT-RAR (speech) and GTZAN-RAR (non-speech) datasets corrupted with a random blend of artifacts each with a random severity to mimic real-world audio signals. Average SDR improvements of over 7.2 dB and 4.9 dB are achieved, respectively, which are substantial when compared with the baseline methods. Significance: This is a pioneer study in blind audio restoration with the unique capability of direct (time-domain) restoration of real-world audio whilst achieving an unprecedented level of performance for a wide SDR range and artifact types. Conclusion: 1D Op-GANs can achieve robust and computationally effective real-world audio restoration with significantly improved performance. The source codes and the generated real-world audio datasets are shared publicly with the research community in a dedicated GitHub repository1

    Neural approaches to spoken content embedding

    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Single-Channel Speech Enhancement Based on Deep Neural Networks

    Speech enhancement (SE) aims to improve the speech quality of the degraded speech. Recently, researchers have resorted to deep-learning as a primary tool for speech enhancement, which often features deterministic models adopting supervised training. Typically, a neural network is trained as a mapping function to convert some features of noisy speech to certain targets that can be used to reconstruct clean speech. These methods of speech enhancement using neural networks have been focused on the estimation of spectral magnitude of clean speech considering that estimating spectral phase with neural networks is difficult due to the wrapping effect. As an alternative, complex spectrum estimation implicitly resolves the phase estimation problem and has been proven to outperform spectral magnitude estimation. In the first contribution of this thesis, a fully convolutional neural network (FCN) is proposed for complex spectrogram estimation. Stacked frequency-dilated convolution is employed to obtain an exponential growth of the receptive field in frequency domain. The proposed network also features an efficient implementation that requires much fewer parameters as compared with conventional deep neural network (DNN) and convolutional neural network (CNN) while still yielding a comparable performance. Consider that speech enhancement is only useful in noisy conditions, yet conventional SE methods often do not adapt to different noisy conditions. In the second contribution, we proposed a model that provides an automatic "on/off" switch for speech enhancement. It is capable of scaling its computational complexity under different signal-to-noise ratio (SNR) levels by detecting clean or near-clean speech which requires no processing. By adopting information maximizing generative adversarial network (InfoGAN) in a deterministic, supervised manner, we incorporate the functionality of SNR-indicator into the model that adds little additional cost to the system. We evaluate the proposed SE methods with two objectives: speech intelligibility and application to automatic speech recognition (ASR). Experimental results have shown that the CNN-based model is applicable for both objectives while the InfoGAN-based model is more useful in terms of speech intelligibility. The experiments also show that SE for ASR may be more challenging than improving the speech intelligibility, where a series of factors, including training dataset and neural network models, would impact the ASR performance

    On the Analysis of DNA Methylation

    Recent genome-wide studies lend support to the idea that the patterns of DNA methylation are in some way related either causally or as a readout of cell-type specific protein binding. We lay the groundwork for a framework to test whether the pattern of DNA methylation levels in a cell combined with protein binding models is sufficient to completely describe the location of the component of proteins binding to its genome in an assayed context. There is only one method, whole-genome bisulfite sequencing, WGBS, available to study DNA methylation genome-wide at such high resolution, however its accuracy has not been determined on the scale of individual binding locations. We address this with a two-fold approach. First, we developed an alternative high-resolution, whole-genome assay using a combination of an enrichment-based and a restriction-enzyme-based assay of methylation, methylCRF. While both assays are considered inferior to WGBS, by using two distinct assays, this method has the advantage that each assay in part cancels out the biases of the other. Additionally, this method is up to 15 times lower in cost than WGBS. By formulating the estimation of methylation from the two methods as a structured prediction problem using a conditional random field, this work will also address the general problem of incorporating data of varying qualities -a common characteristic of biological data- for the purpose of prediction. We show that methylCRF is concordant with WGBS within the range of two WGBS methylomes. Due to the lower cost, we were able to analyze at high-resolution, methylation across more cell-types than previously possible and estimate that 28% of CpGs, in regions comprising 11% of the genome, show variable methylation and are enriched in regulatory regions. Secondly, we show that WGBS has inherent resulution limitations in a read count dependent manner and that the identification of unmethylated regions is highly affected by GC-bias in the underlying protocol suggesting simple estimate procedures may not be sufficient for high-resolution analysis. To address this, we propose a novel approach to DNA methylation analysis using change point detection instead of estimating methylation level directly. However, we show that current change-point detection methods are not robust to methylation signal, we therefore explore how to extend current non-parametric methods to simultaneously find change-points as well as characteristic methylation levels. We believe this framework may have the power to examine the connection between changes in methylation and transcription factor binding in the context of cell-type specific behaviors

    Segmentation and quantification of spinal cord gray matter–white matter structures in magnetic resonance images

    This thesis focuses on finding ways to differentiate the gray matter (GM) and white matter (WM) in magnetic resonance (MR) images of the human spinal cord (SC). The aim of this project is to quantify tissue loss in these compartments to study their implications on the progression of multiple sclerosis (MS). To this end, we propose segmentation algorithms that we evaluated on MR images of healthy volunteers. Segmentation of GM and WM in MR images can be done manually by human experts, but manual segmentation is tedious and prone to intra- and inter-rater variability. Therefore, a deterministic automation of this task is necessary. On axial 2D images acquired with a recently proposed MR sequence, called AMIRA, we experiment with various automatic segmentation algorithms. We first use variational model-based segmentation approaches combined with appearance models and later directly apply supervised deep learning to train segmentation networks. Evaluation of the proposed methods shows accurate and precise results, which are on par with manual segmentations. We test the developed deep learning approach on images of conventional MR sequences in the context of a GM segmentation challenge, resulting in superior performance compared to the other competing methods. To further assess the quality of the AMIRA sequence, we apply an already published GM segmentation algorithm to our data, yielding higher accuracy than the same algorithm achieves on images of conventional MR sequences. On a different topic, but related to segmentation, we develop a high-order slice interpolation method to address the large slice distances of images acquired with the AMIRA protocol at different vertebral levels, enabling us to resample our data to intermediate slice positions. From the methodical point of view, this work provides an introduction to computer vision, a mathematically focused perspective on variational segmentation approaches and supervised deep learning, as well as a brief overview of the underlying project's anatomical and medical background

    Contributions to statistical machine learning algorithm

    This thesis's research focus is on computational statistics along with DEAR (abbreviation of differential equation associated regression) model direction, and that in mind, the journal papers are written as contributions to statistical machine learning algorithm literature

    Bag-of-words representations for computer audition

    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions

    This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation

    Pulmonary Image Segmentation and Registration Algorithms: Towards Regional Evaluation of Obstructive Lung Disease

    Pulmonary imaging, including pulmonary magnetic resonance imaging (MRI) and computed tomography (CT), provides a way to sensitively and regionally measure spatially heterogeneous lung structural-functional abnormalities. These unique imaging biomarkers offer the potential for better understanding pulmonary disease mechanisms, monitoring disease progression and response to therapy, and developing novel treatments for improved patient care. To generate these regional lung structure-function measurements and enable broad clinical applications of quantitative pulmonary MRI and CT biomarkers, as a first step, accurate, reproducible and rapid lung segmentation and registration methods are required. In this regard, we first developed a 1H MRI lung segmentation algorithm that employs complementary hyperpolarized 3He MRI functional information for improved lung segmentation. The 1H-3He MRI joint segmentation algorithm was formulated as a coupled continuous min-cut model and solved through convex relaxation, for which a dual coupled continuous max-flow model was proposed and a max-flow-based efficient numerical solver was developed. Experimental results on a clinical dataset of 25 chronic obstructive pulmonary disease (COPD) patients ranging in disease severity demonstrated that the algorithm provided rapid lung segmentation with high accuracy, reproducibility and diminished user interaction. We then developed a general 1H MRI left-right lung segmentation approach by exploring the left-to-right lung volume proportion prior. The challenging volume proportion-constrained multi-region segmentation problem was approximated through convex relaxation and equivalently represented by a max-flow model with bounded flow conservation conditions. This gave rise to a multiplier-based high performance numerical implementation based on convex optimization theories. In 20 patients with mild- to-moderate and severe asthma, the approach demonstrated high agreement with manual segmentation, excellent reproducibility and computational efficiency. Finally, we developed a CT-3He MRI deformable registration approach that coupled the complementary CT-1H MRI registration. The joint registration problem was solved by exploring optical-flow techniques, primal-dual analyses and convex optimization theories. In a diverse group of patients with asthma and COPD, the registration approach demonstrated lower target registration error than single registration and provided fast regional lung structure-function measurements that were strongly correlated with a reference method. Collectively, these lung segmentation and registration algorithms demonstrated accuracy, reproducibility and workflow efficiency that all may be clinically-acceptable. All of this is consistent with the need for broad and large-scale clinical applications of pulmonary MRI and CT
