Search CORE

46 research outputs found

Microphone Array Post-filter based on Noise Field Coherence

Author: Bourlard Hervé
McCowan Iain A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/03/2006
Field of study

This article introduces a novel technique for estimating the signal power spectral density to be used in the transfer function of a microphone array post-filter. The technique is a generalisation of the existing Zelinski post-filter, which uses the auto- and cross-spectral densities of the array inputs to estimate the signal and noise spectral densities. The Zelinski technique, however, assumes zero cross-correlation between the noise on different sensors. This assumption is inaccurate, particularly at low frequencies and for arrays with closely spaced sensors, and thus the corresponding post-filter is sub-optimal in realistic noise conditions. In this article, a more general expression of the post-filter estimation is developed based on an assumed knowledge of the complex coherence of the noise field. This general expression can be used to construct a more appropriate post-filter in a variety of different noise fields. In experiments using real noise recordings from a computer office, the modified post-filter results in significant improvement in terms of objective speech quality measures and speech recognition performance using a diffuse noise model

Infoscience - École polytechnique fédérale de Lausanne

A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays

Author: Lathoud Guillaume
McCowan Iain A.
Publication venue: Martigny, Switzerland, IDIAP
Publication date: 10/03/2006
Field of study

Microphone arrays are useful in meeting rooms, where speech needs to be acquired and segmented. For example, automatic speech segmentation allows enhanced browsing experience, and facilitates automatic analysis of large amounts of data. Spontaneous multi-party speech includes many overlaps between speakers; moreover other audio sources such as laptops and projectors can be active. For these reasons, locating multiple wideband sources in a reasonable amount of time is highly desirable. In existing multisource localization approaches, search initialization is very often an issue left open. We propose here a methodology for estimating speech activity in a given sector of the space rather than at a particular point. In experiments on more than one hour of speech from real meeting room multisource recordings, we show that the sector-based greatly reduces the search space. At the same time, it achieves effective localization of multiple concurrent speakers

Infoscience - École polytechnique fédérale de Lausanne

Microphone Array Post-filter for Diffuse Noise Field

Author: Bourlard Hervé
McCowan Iain A.
Publication venue: Orlando
Publication date: 10/03/2006
Field of study

This paper proposes a novel technique for estimating the signal power spectral density to be used in the transfer function of a microphone array post-filter. The technique is a modification of the existing Zelinski post-filter, which uses the auto- and cross-spectral densities of the array inputs to estimate the signal and noise spectral densities. The Zelinski technique, however, assumes zero cross-correlation between noise on different sensors. This assumption is inaccurate in real conditions, particularly at low frequencies and for arrays with closely spaced sensors. In this paper we replace this with an assumption of a theoretically diffuse noise field, which is more appropriate in a variety of realistic noise environments. In experiments using noise recordings from an office of computer workstations, the modified post-filter results in significant improvement in terms of objective speech quality measures and speech recognition performance

Infoscience - École polytechnique fédérale de Lausanne

Location Based Speaker Segmentation

Author: Lathoud Guillaume
McCowan Iain A.
Publication venue: Martigny, Switzerland, IDIAP
Publication date: 10/03/2006
Field of study

This paper proposes a technique that segments into speaker turns based on their location, essentially implementing a discrete source tracking system. In many multi-party conversations, such as meetings or teleconferences, the location of participants is restricted to a small number of regions, such as seats around a table. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. A GMM/HMM framework is used to determine an optimal segmentation of the audio according to these locations. We also demonstrate how this approach is easily extended to more complex cases, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of the extension to handle dual-speaker overlap

Infoscience - École polytechnique fédérale de Lausanne

Speech/Music Discrimination using Entropy and Dynamism Features in a HMM Classification Framework

Author: Ajmera Jitendra
Bourlard Hervé
McCowan Iain A.
Publication venue: 'Elsevier BV'
Publication date: 10/03/2006
Field of study

In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the (local) Probability Density Function (PDF) estimators trained on clean, microphone, speech (as used in a standard large vocabulary speech recognition system) are used as a channel model at the output of which the entropy and ``dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. Indeed, in the case of entropy, it is clear that, on average, the entropy at the output of the local PDF estimators will be larger for speech signals than non-speech signals presented at their input. In our case, local probabilities will be estimated from an multilayer perceptron (MLP) as used in hybrid HMM/MLP systems, thus guaranteeing the use of ``real'' probabilities in the estimation of the entropy. The 2-state speech/non-speech HMM will thus take these two dimensional features (entropy and ``dynamism'') whose distributions will be modeled through (two-dimensional) multi-Gaussian densities or an MLP, whose parameters are trained through a Viterbi algorithm.\\ Different experiments, including different speech and music styles, as well as different (a priori) distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), will illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90\%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures

Infoscience - École polytechnique fédérale de Lausanne

An Online Audio Indexing System

Author: Ajmera Jitendra
Bourlard Hervé
McCowan Iain A.
Publication venue
Publication date: 10/03/2006
Field of study

This paper presents overview of an online audio indexing system, which creates a searchable index of speech content embedded in digitized audio files. This system is based on our recently proposed offline audio segmentation techniques. As the data arrives continuously, the system first finds boundaries of the acoustically homogenous segments. Next, each of these segments is classified as speech, music or {\it mixture} classes, where mixtures are defined as regions where speech and other non-speech sounds are present simultaneously and noticeably. The speech segments are then clustered together to provide consistent speaker labels. The speech and mixture segments are converted to text via an ASR system. The resulting words are time-stamped together with other metadata information (speaker identity, speech confidence score) in an XML file to rapidly identify and access target segments. In this paper, we analyze the performance at each stage of this audio indexing system and also compare it with the performance of the corresponding offline modules

Infoscience - École polytechnique fédérale de Lausanne

Robust Audio Segmentation

Author: Ajmera Jitendra
Bourlard Hervé
McCowan Iain A.
Publication venue: IDIAP
Publication date: 10/03/2006
Field of study

Audio segmentation, in general, is the task of segmenting a continuous audio stream in terms of acoustically homogenous regions, where the rule of homogeneity depends on the task. This thesis aims at developing and investigating efficient, robust and unsupervised techniques for three important tasks related to audio segmentation, namely speech/music segmentation, speaker change detection and speaker clustering. The speech/music segmentation technique proposed in this thesis is based on the functioning of a HMM/ANN hybrid ASR system where an MLP estimates the posterior probabilities of different phonemes. These probabilities exhibit a particular pattern when the input is a speech signal. This pattern is captured in the form of feature vectors, which are then integrated in a HMM framework. The technique thus segments the audio data in terms of {\it recognizable} and {\it non-recognizable} segments. The efficiency of the proposed technique is demonstrated by a number of experiments conducted on broadcast news data exhibiting real-life scenarios (different speech and music styles, overlapping speech and music, non-speech sounds other than music, etc.). A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection). The proposed metric can be seen as special case of Log Likelihood Ratio (LLR) or Bayesian Information Criterion (BIC), where the number of parameters in the two models (or hypotheses) is forced to be equal. However, the advantage of the proposed metric over LLR, BIC and other metric based approaches is that it achieves comparable performance without requiring an adjustable threshold/penalty term, hence also eliminating the need for a development dataset. Speaker clustering is the task of unsupervised classification of the audio data in terms of speakers. For this purpose, a novel HMM based agglomerative clustering algorithm is proposed where, starting from a large number of clusters, {\it closest} clusters are merged in an iterative process. A novel merging criterion is proposed for this purpose, which does not require an adjustable threshold value and hence the stopping criterion is also automatically met when there are no more clusters left for merging. The efficiency of the proposed algorithm is demonstrated with various experiments on broadcast news data and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value. These tasks obviously play an important role in the pre-processing stages of ASR. For example, correctly identifying {\it non-recognizable} segments in the audio stream and excluding them from recognition saves computation time in ASR and results in more meaningful transcriptions. Moreover, researchers have clearly shown the positive impact of further clustering of identified speech segments in terms of speakers (speaker clustering) on the transcription accuracy. However, we note that this processing has various other interesting and practical applications. For example, this provides characteristic information about the data (metadata), which is useful for the indexing of audio documents. One such application is investigated in this thesis which extracts this metadata and combines it with the ASR output, resulting in Rich Transcription (RT) which is much easier to understand for an end-user. In a further application, speaker clustering was combined with precise location information available in scenarios like smart meeting rooms to segment the meeting recordings jointly in terms of speakers and their locations in a meeting room. This is useful for automatic meeting summarization as it enables answering of questions like ``who is speaking and where''. This could be used to access, for example, a specific presentation made by a particular speaker or all the speech segments belonging to a particular speaker

Infoscience - École polytechnique fédérale de Lausanne

Validación lingüística y psicométrica (adaptación cultural) de la escala Plutss para disfunción del tracto urinario inferior en niños colombianos

Author: Bengio Samy
Gatica-Perez Daniel
Lathoud Guillaume
McCowan Iain A.
Zhang Dong
Publication venue: 'Anales de la Facultad de Medicina'
Publication date: 10/03/2006
Field of study

80% de los niños con ITU recurrente tiene algún síntoma de disfunción del tracto urinario inferior. Estos síntomas se clasifican según la ICCS (International Childrens Continence Society) de acuerdo a la fase del funcionamiento de la vejiga en la que presenten alteración, están los síntomas de llenado, los de eliminación y los asociados. Caracterizar estos síntomas, en forma objetiva para que no fueran simples relatos descriptivos de quejas de pacientes y pudieran ser utilizados para hacer diagnóstico y monitorear tratamiento obligó al uso de escalas que puntuaran cada uno de ellos. Estas escalas tienen su origen en el concepto del I-PSS (Puntaje Internacional de los Síntomas Prostáticos) que es una herramienta de gran utilidad para la clasificación de la hipertrofia prostática Hoy en día hay tres herramientas validadas para evaluar las alteraciones del tracto urinario inferior en niños; sin embargo ninguna de ellas ha sido a sido traducida al español ni adaptada culturalmente a la población hispanoamericana. El objetivo de este estudio es realizar la adaptación cultural (validación lingüística y psicométrica) de la escala PLUTSS,(4) que ya esta validada y es ampliamente utilizada; para aplicarla en un grupo de niños Colombianos estableciendo así el comportamiento de estos síntomas en nuestra población y para que pueda ser utilizada como herramienta de diagnóstico y seguimiento en los niños con alteración del tracto urinario inferior80% of children with recurrent urinary infection have any symptoms of lower urinary tract dysfunction. To characterize these symptoms, objectively forced the use of scales to rate each of them. Today there are three validated tools to assess the lower urinary tract disorders in children, but none has been been translated into Spanish and culturally adapted to the Hispanic American population. The aim of this study is to adapt the scale PLUTSS cultural, which is proven and widely used, to apply in a group of Colombian children, thus establishing the behavior of these symptoms. METHODOLOGY: The scale PLUTSS (Pediatric Symptom Score Lower Urinary Tract) was translated into Spanish adapted to Colombian dialect according to the admissions standards of translation, synthesis, back translation and recommendation of experts, was applied to a group of 34 patients with clinical diagnosis of urinary tract disorder lower and 95 healthy controls. Validation was conducted appearance, construct validation, we assessed the internal consistency of the instrument, and compared with results obtained in the original scale. RESULTS: The median of the two groups (healthy and diseased) was significantly different, with a sensitivity and specificity of 90% cut off point 1.5. Internal consistency of the 13-item scale was high alpha Crobanch, (0.88). Established the criterion validity of the scale with the clinical diagnosis found a significant correlation of strong character (CONCLUSIONS: The scale linguistic and psychometrically validated PLUTSS under international standards validation of scales is the only scale adapted Spanish. showed a high correlation with the clinical diagnosis and high power to discriminate urinary symptoms

Infoscience - École polytechnique fédérale de Lausanne

edocUR

Robust Speech Recognition with Small Microphone Arrays using the Missing Data Approach

Author: Bourlard Hervé
McCowan Iain A.
Morris Andrew
Publication venue: Martigny, Switzerland, IDIAP
Publication date: 10/03/2006
Field of study

Traditional microphone array speech recognition systems simply recognise the enhanced output of the array. As the level of signal enhancement depends on the number of microphones, such systems do not achieve acceptable speech recognition performance for arrays having only a few microphones. For small microphone arrays, we instead propose using the enhanced output to estimate a reliability mask, which is then used in missing data speech recognition. In missing data speech recognition, the decoded sequence depends on the reliability of each input feature. This reliability is usually based on the signal to noise ratio in each frequency band. In this paper, we use the energy difference between the noisy input and the enhanced output of a small microphone array to determine the frequency band reliability. Recognition experiments with a small array demonstrate the effectiveness of the technique, compared to both traditional microphone array enhancement and a baseline missing data system

Infoscience - École polytechnique fédérale de Lausanne