1,180 research outputs found

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF

    Scientific Information Extraction with Semi-supervised Neural Tagging

    Full text link
    This paper addresses the problem of extracting keyphrases from scientific articles and categorizing them as corresponding to a task, process, or material. We cast the problem as sequence tagging and introduce semi-supervised methods to a neural tagging model, which builds on recent advances in named entity recognition. Since annotated training data is scarce in this domain, we introduce a graph-based semi-supervised algorithm together with a data selection scheme to leverage unannotated articles. Both inductive and transductive semi-supervised learning strategies outperform state-of-the-art information extraction performance on the 2017 SemEval Task 10 ScienceIE task.Comment: accepted by EMNLP 201

    NMF-based compositional models for audio source separation

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 김남수.Many classes of data can be represented by constructive combinations of parts. Most signal and data from nature have nonnegative values and can be explained and reconstructed by constructive models. By the constructive models, only the additive combination is allowed and it does not result in subtraction of parts. The compositional models include dictionary learning, exemplar-based approaches, and nonnegative matrix factorization (NMF). Compositional models are desirable in many areas including image or visual signal processing, text information processing, audio signal processing, and music information retrieval. In this dissertation, we choose NMF for compositional models and NMF-based target source separation is performed for the application. The target source separation is the extraction or reconstruction of the target signals in the mixture signals which consists with the target and interfering signals. The target source separation can be thought as blind source separation (BSS). BSS aims that the original unknown source signals are extracted without knowing or with very limited information. However, in these days, much of prior information is frequently utilized, and various approaches have been proposed for single channel source separation. NMF basically approximates a nonnegative data matrix V with a product of nonnegative basis and encoding matrices W and H, i.e., V WH. Since both W and H are nonnegative, NMF often leads to a part based representation of the data. The methods based on NMF have shown impressive results in single channel source separation The objective function of NMF is generally presented Euclidean distant, Kullback-Leibler divergence, and Itakura-saito divergence. Many optimization methods have been proposed and utilized, e.g., multiplicative update rule, projected gradient descent and NeNMF. However, NMF-based audio source separation has some issues as follows: non-uniqueness of the bases, a high dependence to the prior information, the overlapped subspace between target bases and interfering bases, a disregard of the encoding vectors from the training phase, and insucient analysis of sparse NMF. In this dissertation, we propose new approaches to resolve the above issues. In section 4, we propose a novel speech enhancement method that combines the statistical model-based enhancement scheme with the NMF-based gain function. For a better performance in time-varying noise environments, both the speech and noise bases of NMF are adapted simultaneously with the help of the estimated speech presence probability. In section 5, we propose a discriminative NMF (DNMF) algorithm which exploits the reconstruction error for the interfering signals as well as the target signal based on target bases. In section 6, we propose an approach to robust bases estimation in which an incremental strategy is adopted. Based on an analogy between clustering and NMF analysis, we incrementally estimate the NMF bases similar to the modied k-means and Linde-Buzo-Gray algorithms popular in the data clustering area. In Section 7, the distribution of the encoding vector is modeled as a multivariate exponential PDF (MVE) with a single scaling factor for each source. In Section 8, several sparse penalty terms for NMF are analyzed and compared in terms of signal to distortion ratio, sparseness of encoding vectors, reconstruction error, and entropy of basis vectors. The new objective function which applied sparse representation and discriminative NMF (DNMF) is also proposed.1 Introduction 1 1.1 Audio source separation 1 1.2 Speech enhancement 3 1.3 Measurements 4 1.4 Outline of the dissertation 6 2 Compositional model and NMF 9 2.1 Compositional model 9 2.2 NMF 14 2.2.1 Update rules: MuR, PGD 16 2.2.2 Modied NMF 20 3 NMF-based audio source separation and issues 23 3.1 NMF-based audio source separation 23 3.2 Problems of NMF in audio source separation 26 3.2.1 A high dependency to the prior knowledge 26 3.2.2 A overlapped subspace between the target and interfering basis matrices 28 3.2.3 A non-uniqueness of the bases 29 3.2.4 A prior knowledge of the encoding vectors 30 3.2.5 Sparse NMF for the source separation 32 4 Online bases update 33 4.1 Introduction 33 4.2 NMF-based speech enhancement using spectral gain function 36 4.3 Speech enhancement combining statistical model-based and NMFbased methods with the on-line bases update 38 4.3.1 On-line update of speech and noise bases 40 4.3.2 Determining maximum update rates 42 4.4 Experiment result 43 5 Discriminative NMF 47 5.1 Introduction 47 5.2 Discriminative NMF utilizing cross reconstruction error 48 5.2.1 DNMF using the reconstruction error of the other source 49 5.2.2 DNMF using the interference factors 50 5.3 Experiment result 52 6 Incremental approach for bases estimate 57 6.1 Introduction 57 6.2 Incremental approach based on modied k-means clustering and Linde-Buzo-Gray algorithm 59 6.2.1 Based on modied k-means clustering 59 6.2.2 LBG based incremental approach 62 6.3 Experiment result 63 6.3.1 Modied k-means clustering based approach 63 6.3.2 LBG based approach 66 7 Prior model of encoding vectors 77 7.1 Introduction 77 7.2 Prior model of encoding vectors based on multivariate exponential distribution 78 7.3 Experiment result 82 8 Conclusions 87 Bibliography 91 국문초록 105Docto
    corecore