9 research outputs found

    Robust Audio and WiFi Sensing via Domain Adaptation and Knowledge Sharing From External Domains

    Get PDF
    Recent advancements in machine learning have initiated a revolution in embedded sensing and inference systems. Acoustic and WiFi-based sensing and inference systems have enabled a wide variety of applications ranging from home activity detection to health vitals monitoring. While many existing solutions paved the way for acoustic event recognition and WiFi-based activity detection, the diverse characteristics in sensors, systems, and environments used for data capture cause a shift in the distribution of data and thus results in sub-optimal classification performance when the sensor and environment discrepancy occurs between training and inference stage. Moreover, large-scale acoustic and WiFi data collection is non-trivial and cumbersome. Therefore, current acoustic and WiFi-based sensing systems suffer when there is a lack of labeled samples as they only rely on the provided training data. In this thesis, we aim to address the performance loss of machine learning-based classifiers for acoustic and WiFi-based sensing systems due to sensor and environment heterogeneity and lack of labeled examples. We show that discovering latent domains (sensor type, environment, etc.) and removing domain bias from machine learning classifiers make acoustic and WiFi-based sensing robust and generalized. We also propose a few-shot domain adaptation method that requires only one labeled sample for a new domain that relieves the users and developers from the painstaking task of data collection at each new domain. Furthermore, to address the lack of labeled examples, we propose to exploit the information or learned knowledge from sources where available data already exists in volumes, such as textual descriptions and visual domain. We implemented our algorithms in mobile and embedded platforms and collected data from participants to evaluate our proposed algorithms and frameworks in an extensive manner.Doctor of Philosoph

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Efficient and Robust Methods for Audio and Video Signal Analysis

    Get PDF
    This thesis presents my research concerning audio and video signal processing and machine learning. Specifically, the topics of my research include computationally efficient classifier compounds, automatic speech recognition (ASR), music dereverberation, video cut point detection and video classification.Computational efficacy of information retrieval based on multiple measurement modalities has been considered in this thesis. Specifically, a cascade processing framework, including a training algorithm to set its parameters has been developed for combining multiple detectors or binary classifiers in computationally efficient way. The developed cascade processing framework has been applied on video information retrieval tasks of video cut point detection and video classification. The results in video classification, compared to others found in the literature, indicate that the developed framework is capable of both accurate and computationally efficient classification. The idea of cascade processing has been additionally adapted for the ASR task. A procedure for combining multiple speech state likelihood estimation methods within an ASR framework in cascaded manner has been developed. The results obtained clearly show that without impairing the transcription accuracy the computational load of ASR can be reduced using the cascaded speech state likelihood estimation process.Additionally, this thesis presents my work on noise robustness of ASR using a nonnegative matrix factorization (NMF) -based approach. Specifically, methods for transformation of sparse NMF-features into speech state likelihoods has been explored. The results reveal that learned transformations from NMF activations to speech state likelihoods provide better ASR transcription accuracy than dictionary label -based transformations. The results, compared to others in a noisy speech recognition -challenge show that NMF-based processing is an efficient strategy for noise robustness in ASR.The thesis also presents my work on audio signal enhancement, specifically, on removing the detrimental effect of reverberation from music audio. In the work, a linear prediction -based dereverberation algorithm, which has originally been developed for speech signal enhancement, was applied for music. The results obtained show that the algorithm performs well in conjunction with music signals and indicate that dynamic compression of music does not impair the dereverberation performance

    計算力学研究センター年次報告書

    Get PDF

    Deconvolutive Clustering of Markov States

    No full text
    Abstract. In this paper we formulate the problem of grouping the states of a discrete Markov chain of arbitrary order simultaneously with deconvolving its transition probabilities. As the name indicates, this problem is related to deconvolutive blind signal separation. However, whilst the latter has been studied in the context of continuous signal processing, e.g. as a model of a real-room mixing of sound signals, our technique tries to model computer-mediated group-discussion participation from a discrete event-log sequence. In this context, convolution occurs due to various time-delay factors, such as the network transmission bandwidth or simply the typing speed of the participants. We derive a computationally efficient maximum likelihood estimation algorithm associated with our model, which exploits the sparsity of state transitions and scales linearly with the number of observed higher order transition patterns. Results obtained on a full day worth dynamic real-world Internet Relay Chat participation sequence demonstrate the advantages of our approach over state grouping alone, both in terms of penalised data likelihood and cluster clarity. Other potential applications of our model, viewed as a novel compact approximation of large Markov chains, are also discussed.
    corecore