150 research outputs found

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling

    Articulatory features for conversational speech recognition

    Get PDF

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Robust Anomaly Detection with Applications to Acoustics and Graphs

    Get PDF
    Our goal is to develop a robust anomaly detector that can be incorporated into pattern recognition systems that may need to learn, but will never be shunned for making egregious errors. The ability to know what we do not know is a concept often overlooked when developing classifiers to discriminate between different types of normal data in controlled experiments. We believe that an anomaly detector should be used to produce warnings in real applications when operating conditions change dramatically, especially when other classifiers only have a fixed set of bad candidates from which to choose. Our approach to distributional anomaly detection is to gather local information using features tailored to the domain, aggregate all such evidence to form a global density estimate, and then compare it to a model of normal data. A good match to a recognizable distribution is not required. By design, this process can detect the "unknown unknowns" [1] and properly react to the "black swan events" [2] that can have devastating effects on other systems. We demonstrate that our system is robust to anomalies that may not be well-defined or well-understood even if they have contaminated the training data that is assumed to be non-anomalous. In order to develop a more robust speech activity detector, we reformulate the problem to include acoustic anomaly detection and demonstrate state-of-the-art performance using simple distribution modeling techniques that can be used at incredibly high speed. We begin by demonstrating our approach when training on purely normal conversational speech and then remove all annotation from our training data and demonstrate that our techniques can robustly accommodate anomalous training data contamination. When comparing continuous distributions in higher dimensions, we develop a novel method of discarding portions of a semi-parametric model to form a robust estimate of the Kullback-Leibler divergence. Finally, we demonstrate the generality of our approach by using the divergence between distributions of vertex invariants as a graph distance metric and achieve state-of-the-art performance when detecting graph anomalies with neighborhoods of excessive or negligible connectivity. [1] D. Rumsfeld. (2002) Transcript: DoD news briefing - Secretary Rumsfeld and Gen. Myers. [2] N. N. Taleb, The Black Swan: The Impact of the Highly Improbable. Random House, 2007

    EEG Signal Processing in Motor Imagery Brain Computer Interfaces with Improved Covariance Estimators

    Get PDF
    Desde hace unos años hasta la actualidad, el desarrollo en el campo de los interfaces cerebro ordenador ha ido aumentando. Este aumento viene motivado por una serie de factores distintos. A medida que aumenta el conocimiento acerca del cerebro humano y como funciona (del que aún se conoce relativamente poco), van surgiendo nuevos avances en los sistemas BCI que, a su vez, sirven de motivación para que se investigue más acerca de este órgano. Además, los sistemas BCI abren una puerta para que cualquier persona pueda interactuar con su entorno independientemente de la discapacidad física que pueda tener, simplemente haciendo uso de sus pensamientos. Recientemente, la industria tecnológica ha comenzado a mostrar su interés por estos sistemas, motivados tanto por los avances con respecto a lo que conocemos del cerebro y como funciona, como por el uso constante que hacemos de la tecnología en la actuali- dad, ya sea a través de nuestros smartphones, tablets u ordenadores, entre otros muchos dispositivos. Esto motiva que compañías como Facebook inviertan en el desarrollo de sistemas BCI para que tanto personas sin discapacidad como aquellas que, si las tienen, puedan comunicarse con los móviles usando solo el cerebro. El trabajo desarrollado en esta tesis se centra en los sistemas BCI basados en movimien- tos imaginarios. Esto significa que el usuario piensa en movimientos motores que son interpretados por un ordenador como comandos. Las señales cerebrales necesarias para traducir posteriormente a comandos se obtienen mediante un equipo de EEG que se coloca sobre el cuero cabelludo y que mide la actividad electromagnética producida por el cere- bro. Trabajar con estas señales resulta complejo ya que son no estacionarias y, además, suelen estar muy contaminadas por ruido o artefactos. Hemos abordado esta temática desde el punto de vista del procesado estadístico de la señal y mediante algoritmos de aprendizaje máquina. Para ello se ha descompuesto el sistema BCI en tres bloques: preprocesado de la señal, extracción de características y clasificación. Tras revisar el estado del arte de estos bloques, se ha resumido y adjun- tado un conjunto de publicaciones que hemos realizado durante los últimos años, y en las cuales podemos encontrar las diferentes aportaciones que, desde nuestro punto de vista, mejoran cada uno de los bloques anteriormente mencionados. De manera muy resumida, para el bloque de preprocesado proponemos un método mediante el cual conseguimos nor- malizar las fuentes de las señales de EEG. Al igualar las fuentes efectivas conseguimos mejorar la estima de las matrices de covarianza. Con respecto al bloque de extracción de características, hemos conseguido extender el algoritmo CSP a casos no supervisados. Por último, en el bloque de clasificación también hemos conseguido realizar una sepa- ración de clases de manera no supervisada y, por otro lado, hemos observado una mejora cuando se regulariza el algoritmo LDA mediante un método específico para Gaussianas.The research and development in the field of Brain Computer Interfaces (BCI) has been growing during the last years, motivated by several factors. As the knowledge about how the human brain is and works (of which we still know very little) grows, new advances in BCI systems are emerging that, in turn, serve as motivation to do more re- search about this organ. In addition, BCI systems open a door for anyone to interact with their environment regardless of the physical disabilities they may have, by simply using their thoughts. Recently, the technology industry has begun to show its interest in these systems, mo- tivated both by the advances about what we know of the brain and how it works, and by the constant use we make of technology nowadays, whether it is by using our smart- phones, tablets or computers, among many other devices. This motivates companies like Facebook to invest in the development of BCI systems so that people (with or without disabilities) can communicate with their devices using only their brain. The work developed in this thesis focuses on BCI systems based on motor imagery movements. This means that the user thinks of certain motor movements that are in- terpreted by a computer as commands. The brain signals that we need to translate to commands are obtained by an EEG device that is placed on the scalp and measures the electromagnetic activity produced by the brain. Working with these signals is complex since they are non-stationary and, in addition, they are usually heavily contaminated by noise or artifacts. We have approached this subject from the point of view of statistical signal processing and through machine learning algorithms. For this, the BCI system has been split into three blocks: preprocessing, feature extraction and classification. After reviewing the state of the art of these blocks, a set of publications that we have made in recent years has been summarized and attached. In these publications we can find the different contribu- tions that, from our point of view, improve each one of the blocks previously mentioned. As a brief summary, for the preprocessing block we propose a method that lets us nor- malize the sources of the EEG signals. By equalizing the effective sources, we are able to improve the estimation of the covariance matrices. For the feature extraction block, we have managed to extend the CSP algorithm for unsupervised cases. Finally, in the classification block we have also managed to perform a separation of classes in an blind way and we have also observed an improvement when the LDA algorithm is regularized by a specific method for Gaussian distributions

    Speech dereverberation and speaker separation using microphone arrays in realistic environments

    Get PDF
    This thesis concentrates on comparing novel and existing dereverberation and speaker separation techniques using multiple corpora, including a new corpus collected using a microphone array. Many corpora currently used for these techniques are recorded using head-mounted microphones in anechoic chambers. This novel corpus contains recordings with noise and reverberation made in office and workshop environments. Novel algorithms present a different way of approximating the reverberation, producing results that are competitive with existing algorithms. Dereverberation is evaluated using seven correlation-based algorithms and applied to two different corpora. Three of these are novel algorithms (Hs NTF, Cauchy WPE and Cauchy MIMO WPE). Both non-learning and learning algorithms are tested, with the learning algorithms performing better. For single and multi-channel speaker separation, unsupervised non-negative matrix factorization (NMF) algorithms are compared using three cost functions combined with sparsity, convolution and direction of arrival. The results show that the choice of cost function is important for improving the separation result. Furthermore, six different supervised deep learning algorithms are applied to single channel speaker separation. Historic information improves the result. When comparing NMF to deep learning, NMF is able to converge faster to a solution and provides a better result for the corpora used in this thesis

    Learning representations for speech recognition using artificial neural networks

    Get PDF
    Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals

    On robust spatial filtering of EEG in nonstationary environments

    Full text link

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
    corecore