7 research outputs found

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    A Nonlinear Mixture Autoregressive Model For Speaker Verification

    Get PDF
    In this work, we apply a nonlinear mixture autoregressive (MixAR) model to supplant the Gaussian mixture model for speaker verification. MixAR is a statistical model that is a probabilistically weighted combination of components, each of which is an autoregressive filter in addition to a mean. The probabilistic mixing and the datadependent weights are responsible for the nonlinear nature of the model. Our experiments with synthetic as well as real speech data from standard speech corpora show that MixAR model outperforms GMM, especially under unseen noisy conditions. Moreover, MixAR did not require delta features and used 2.5x fewer parameters to achieve comparable or better performance as that of GMM using static as well as delta features. Also, MixAR suffered less from overitting issues than GMM when training data was sparse. However, MixAR performance deteriorated more quickly than that of GMM when evaluation data duration was reduced. This could pose limitations on the required minimum amount of evaluation data when using MixAR model for speaker verification

    Emotional Prosody Processing in the Schizophrenia Spectrum.

    Get PDF
    THESIS ABSTRACT Emotional prosody processing impairment is proposed to be a main contributing factor for the formation of auditory verbal hallucinations in patients with schizophrenia. In order to evaluate such assumption, five experiments in healthy, highly schizotypal and schizophrenia populations are presented. The first part of the thesis seeks to reveal the neural underpinnings of emotional prosody comprehension (EPC) in a non-clinical population as well as the modulation of prosodic abilities by hallucination traits. By revealing the brain representation of EPC, an overlap at the neural level between EPC and auditory verbal hallucinations (AVH) was strongly suggested. By assessing the influence of hallucinatory traits on EPC abilities, a continuum in the schizophrenia spectrum in which high schizotypal population mirrors the neurocognitive profile of schizophrenia patients was established. Moreover, by studying the relation between AVH and EPC in non-clinical population, potential confounding effects of medication influencing the findings were minimized. The second part of the thesis assessed two EPC related abilities in schizophrenia patients with and without hallucinations. Firstly, voice identity recognition, a skill which relies on the analysis of some of the same acoustical features as EPC, has been evaluated in patients and controls. Finally, the last study presented in the current thesis, assessed the influence that implicit processing of emotional prosody has on selective attention in patients and controls. Both patients studies demonstrate that voice identity recognition deficits as well as abnormal modulation of selective attention by implicit emotion prosody are related to hallucinations exclusively and not to schizophrenia in general. In the final discussion, a model in which EPC deficits are a crucial factor in the formation of AVH is evaluated. Experimental findings presented in the previous chapters strongly suggests that the perception of prosodic features is impaired in patients with AVH, resulting in aberrant perception of irrelevant auditory objects with emotional prosody salience which captures the attention of the hearer and which sources (speaker identity) cannot be recognized. Such impairments may be due to structural and functional abnormalities in a network which comprises the superior temporal gyrus as a central element

    Frame-level features conveying phonetic information for language and speaker recognition

    Get PDF
    150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

    Subsidia: Tools and Resources for Speech Sciences

    Get PDF
    Este libro, resultado de la colaboración de investigadores expertos en sus respectivas áreas, pretende ser una ayuda a la comunidad científica en tanto en cuanto recopila y describe una serie de materiales de gran utilidad para seguir avanzando en la investigació

    Voice Modeling Methods for Automatic Speaker Recognition

    Get PDF
    Building a voice model means to capture the characteristics of a speaker´s voice in a data structure. This data structure is then used by a computer for further processing, such as comparison with other voices. Voice modeling is a vital step in the process of automatic speaker recognition that itself is the foundation of several applied technologies: (a) biometric authentication, (b) speech recognition and (c) multimedia indexing. Several challenges arise in the context of automatic speaker recognition. First, there is the problem of data shortage, i.e., the unavailability of sufficiently long utterances for speaker recognition. It stems from the fact that the speech signal conveys different aspects of the sound in a single, one-dimensional time series: linguistic (what is said?), prosodic (how is it said?), individual (who said it?), locational (where is the speaker?) and emotional features of the speech sound itself (to name a few) are contained in the speech signal, as well as acoustic background information. To analyze a specific aspect of the sound regardless of the other aspects, analysis methods have to be applied to a specific time scale (length) of the signal in which this aspect stands out of the rest. For example, linguistic information (i.e., which phone or syllable has been uttered?) is found in very short time spans of only milliseconds of length. On the contrary, speakerspecific information emerges the better the longer the analyzed sound is. Long utterances, however, are not always available for analysis. Second, the speech signal is easily corrupted by background sound sources (noise, such as music or sound effects). Their characteristics tend to dominate a voice model, if present, such that model comparison might then be mainly due to background features instead of speaker characteristics. Current automatic speaker recognition works well under relatively constrained circumstances, such as studio recordings, or when prior knowledge on the number and identity of occurring speakers is available. Under more adverse conditions, such as in feature films or amateur material on the web, the achieved speaker recognition scores drop below a rate that is acceptable for an end user or for further processing. For example, the typical speaker turn duration of only one second and the sound effect background in cinematic movies render most current automatic analysis techniques useless. In this thesis, methods for voice modeling that are robust with respect to short utterances and background noise are presented. The aim is to facilitate movie analysis with respect to occurring speakers. Therefore, algorithmic improvements are suggested that (a) improve the modeling of very short utterances, (b) facilitate voice model building even in the case of severe background noise and (c) allow for efficient voice model comparison to support the indexing of large multimedia archives. The proposed methods improve the state of the art in terms of recognition rate and computational efficiency. Going beyond selective algorithmic improvements, subsequent chapters also investigate the question of what is lacking in principle in current voice modeling methods. By reporting on a study with human probands, it is shown that the exclusion of time coherence information from a voice model induces an artificial upper bound on the recognition accuracy of automatic analysis methods. A proof-of-concept implementation confirms the usefulness of exploiting this kind of information by halving the error rate. This result questions the general speaker modeling paradigm of the last two decades and presents a promising new way. The approach taken to arrive at the previous results is based on a novel methodology of algorithm design and development called “eidetic design". It uses a human-in-the-loop technique that analyses existing algorithms in terms of their abstract intermediate results. The aim is to detect flaws or failures in them intuitively and to suggest solutions. The intermediate results often consist of large matrices of numbers whose meaning is not clear to a human observer. Therefore, the core of the approach is to transform them to a suitable domain of perception (such as, e.g., the auditory domain of speech sounds in case of speech feature vectors) where their content, meaning and flaws are intuitively clear to the human designer. This methodology is formalized, and the corresponding workflow is explicated by several use cases. Finally, the use of the proposed methods in video analysis and retrieval are presented. This shows the applicability of the developed methods and the companying software library sclib by means of improved results using a multimodal analysis approach. The sclib´s source code is available to the public upon request to the author. A summary of the contributions together with an outlook to short- and long-term future work concludes this thesis
    corecore