Search CORE

4 research outputs found

Modelling non-stationary noise with spectral factorisation in automatic speech recognition

Author: Acero
Antti Hurmalainen
Barker
Cichocki
Cooke
Delcroix
Demuynck
Gales
Gemmeke
Gemmeke
Heittola
Hershey
Hurmalainen
Hurmalainen
Hurmalainen
Jort F. Gemmeke
Kinoshita
Lee
Maas
Mahkonen
Ming
Mysore
O’Grady
Raj
Schmidth
Smaragdis
Sundaram
Tuomas Virtanen
Van Segbroeck
Vipperla
Virtanen
Virtanen
Wachter
Wachter
Wang
Wang
Weninger
Weninger
Wilson
Wilson
Young
Publication venue: 'Elsevier BV'
Publication date
Field of study

Robust speech recognition with spectrogram factorisation

Author: Hurmalainen Antti
Publication venue: Tampere University of Technology
Publication date: 01/01/2014
Field of study

Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

Trepo - Institutional Repository of Tampere University

Modelling non-stationary noise with spectral factorisation in automatic speech recognition

Author: Gemmeke Jort
Hurmalainen Antti
Virtanen Tuomas
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

Hurmalainen A., Gemmeke J.F., Virtanen T., ''Modelling non-stationary noise with spectral factorisation in automatic speech recognition'', Computer speech and language, vol. 27, no. 3, pp. 763-779, May 2013.status: publishe

Lirias

Object-based Modeling of Audio for Coding and Source Separation

Author: Nikunen Joonas
Publication venue: Tampere University of Technology
Publication date: 01/01/2015
Field of study

This thesis studies several data decomposition algorithms for obtaining an object-based representation of an audio signal. The estimation of the representation parameters are coupled with audio-specific criteria, such as the spectral redundancy, sparsity, perceptual relevance and spatial position of sounds. The objective is to obtain an audio signal representation that is composed of meaningful entities called audio objects that reflect the properties of real-world sound objects and events. The estimation of the object-based model is based on magnitude spectrogram redundancy using non-negative matrix factorization with extensions to multichannel and complex-valued data. The benefits of working with object-based audio representations over the conventional time-frequency bin-wise processing are studied. The two main applications of the object-based audio representations proposed in this thesis are spatial audio coding and sound source separation from multichannel microphone array recordings. In the proposed spatial audio coding algorithm, the audio objects are estimated from the multichannel magnitude spectrogram. The audio objects are used for recovering the content of each original channel from a single downmixed signal, using time-frequency filtering. The perceptual relevance of modeling the audio signal is considered in the estimation of the parameters of the object-based model, and the sparsity of the model is utilized in encoding its parameters. Additionally, a quantization of the model parameters is proposed that reflects the perceptual relevance of each quantized element. The proposed object-based spatial audio coding algorithm is evaluated via listening tests and comparing the overall perceptual quality to conventional time-frequency block-wise methods at the same bitrates. The proposed approach is found to produce comparable coding efficiency while providing additional functionality via the object-based coding domain representation, such as the blind separation of the mixture of sound sources in the encoded channels. For the sound source separation from multichannel audio recorded by a microphone array, a method combining an object-based magnitude model and spatial covariance matrix estimation is considered. A direction of arrival-based model for the spatial covariance matrices of the sound sources is proposed. Unlike the conventional approaches, the estimation of the parameters of the proposed spatial covariance matrix model ensures a spatially coherent solution for the spatial parameterization of the sound sources. The separation quality is measured with objective criteria and the proposed method is shown to improve over the state-of-the-art sound source separation methods, with recordings done using a small microphone array

Trepo - Institutional Repository of Tampere University