Search CORE

75 research outputs found

Studies on noise robust automatic speech recognition

Author: Kurimo Mikko
Palomäki Kalle J.
Remes Ulpu
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2009
Field of study

Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

Aaltodoc Publication Archive

Recognition of Harmonic Sounds in Polyphonic Audio using a Missing Feature Approach: Extended Report

Author: Giannoulis Dimitrios
Klapuri Anssi
Plumbley Mark
Publication venue
Publication date: 01/01/2013
Field of study

A method based on local spectral features and missing feature techniques is proposed for the recognition of harmonic sounds in mixture signals. A mask estimation algorithm is proposed for identifying spectral regions that contain reliable information for each sound source and then bounded marginalization is employed to treat the feature vector elements that are determined as unreliable. The proposed method is tested on musical instrument sounds due to the extensive availability of data but it can be applied on other sounds (i.e. animal sounds, environmental sounds), whenever these are harmonic. In simulations the proposed method clearly outperformed a baseline method for mixture signals

Crossref

University of Surrey

Queen Mary Research Online

Surrey Research Insight

Missing data mask models with global frequency and temporal constraints

Author: Cerisara Christophe
Demange Sébastien
Haton Jean-Paul
Publication venue: HAL CCSD
Publication date: 17/09/2006
Field of study

Missing data recognition has been developped in order to increase noise robustness in automatic speech recognition. Many different factors, including the speech decoding process itself, shall be considered to locate the masks. In this work, we are considering Bayesian models of the masks, where every spectral feature is classified as reliable or masked, and is independent from the rest of the signal. This classification strategy can produce unrelated small ``spots'', while experiments suggest that oracle reliable and unreliable features tend to be clustered into time-frequency blocks. We call this undesired effect: the ``checkerboard'' effect. In this paper, we propose a new Bayesian missing data classifier that integrates frequency and temporal constraints in order to reduce, or avoid, this ``checkerboard'' effect. The proposed classifier is evaluated on the Aurora2 connected digit corpora. Integrating such constraints in the missing data classification leads to significant improvements in recognition accuracy

INRIA a CCSD electronic archive server

Mask Estimation For Missing Data Recognition Using Background Noise Sniffing

Author: Cerisara Christophe
Demange Sébastien
Haton Jean-Paul
Publication venue: HAL CCSD
Publication date: 19/05/2006
Field of study

This paper addresses the problem of spectrographic mask estimation in the context of missing data recognition. At the difference of other denoising methods, missing data recognition does not match the whole spectrum with the acoustic models, but rather considers that some time-frequency pixels are missing, i.e. corrupted by noise. Correctly estimating these ``masks'' is very important for missing data recognizers. We propose a new approach that exploits some a priori knowledge about these masks in typical noisy environments to address this difficult challenge. The proposed mask is then obtained by combining these noise dependent masks. The combination is led by an environmental ``sniffing'' module that estimates the probability of being in each typical noisy condition. This missing data mask estimation procedure has been integrated in a complete missing data recognizer using bounded marginalization. Our approach is evaluated on the Aurora2 database

INRIA a CCSD electronic archive server

Sample Drop Detection for Distant-speech Recognition with Asynchronous Devices Distributed in Space

Author: Omologo Maurizio
Pascual Santiago
Raissi Tina
Publication venue
Publication date: 15/11/2019
Field of study

In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based model that given a short window in the time domain, detects whether one or more devices have been subjected to a sample drop event. The candidate time windows are selected from a set of large time intervals, possibly including a sample drop, and by using a preprocessing step. The latter is based on the application of normalized cross-correlation between signals acquired by different devices. The architecture of the neural network relies on a CNN-LSTM encoder, followed by multi-head attention. The experiments are conducted using both artificial and real data. Our proposed approach obtained F1 score of 88% on an evaluation set extracted from the CHiME-5 corpus. A comparable performance was found in a larger set of experiments conducted on a set of multi-channel artificial scenes.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

Author: Ellis Daniel P. W.
Weiss Ron J.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

Audio sources frequently concentrate much of their energy into a relatively small proportion of the available time-frequency cells in a short-time Fourier transform (STFT). This sparsity makes it possible to separate sources, to some degree, simply by selecting STFT cells dominated by the desired source, setting all others to zero (or to an estimate of the obscured target value), and inverting the STFT to a waveform. The problem of source separation then becomes identifying the cells containing good target information. We treat this as a classification problem, and train a Relevance Vector Machine (a probabilistic relative of the Support Vector Machine) to perform this task. We compare the performance of this classifier both against SVMs (it has similar accuracy but is not as efficient as RVMs), and against a traditional Computational Auditory Scene Analysis (CASA) technique based on a noise-robust pitch tracker, which the RVM outperforms significantly. Differences between the RVM- and pitch-tracker-based mask estimation suggest benefits to be obtained by combining both

CiteSeerX

Columbia University Academic Commons

Time-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition

Author: Marco K&#252
Roberto Togneri
Sven Nordholm
Publication venue: 'IntechOpen'
Publication date: 01/01/2008
Field of study

IntechOpen

Crossref

espace@Curtin