1,105 research outputs found

    Speech Decomposition and Enhancement

    Get PDF
    The goal of this study is to investigate the roles of steady-state speech sounds and transitions between these sounds in the intelligibility of speech. The motivation for this approach is that the auditory system may be particularly sensitive to time-varying frequency edges, which in speech are produced primarily by transitions between vowels and consonants and within vowels. The possibility that selectively amplifying these edges may enhance speech intelligibility is examined. Computer algorithms to decompose speech into two different components were developed. One component, which is defined as a tonal component, was intended to predominately include formant activity. The second component, which is defined as a non-tonal component, was intended to predominately include transitions between and within formants.The approach to the decomposition is to use a set of time-varying filters whose center frequencies and bandwidths are controlled to identify the strongest formant components in speech. Each center frequency and bandwidth is estimated based on FM and AM information of each formant component. The tonal component is composed of the sum of the filter outputs. The non-tonal component is defined as the difference between the original speech signal and the tonal component.The relative energy and intelligibility of the tonal and non-tonal components were compared to the original speech. Psychoacoustic growth functions were used to assess the intelligibility. Most of the speech energy was in the tonal component, but this component had a significantly lower maximum word recognition than the original and non-tonal component had. The non-tonal component averaged 2% of the original speech energy, but this component had almost equal maximum word recognition as the original speech. The non-tonal component was amplified and recombined with the original speech to generate enhanced speech. The energy of the enhanced speech was adjusted to be equal to the original speech, and the intelligibility of the enhanced speech was compared to the original speech in background noise. The enhanced speech showed higher recognition scores at lower SNRs, and the differences were significant. The original and enhanced speech showed similar recognition scores at higher SNRs. These results suggest that amplification of transient information can enhance the speech in noise and this enhancement method is more effective at severe noise conditions

    Comparison of VQ and DTW classifiers for speaker verification

    Get PDF
    This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.---- Copyright IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.An investigation into the relative speaker verification performance of various types of vector quantisation (VQ) and dynamic time warping (DTW) classifiers is presented. The study covers a number of algorithmic issues involved in the above classifiers, and examines the effects of these on the verification accuracy. The experiments are based on the use of a subset from the Brent (telephone quality) speech database. This subset consists of repetitions of isolated digit utterances 1 to 9 and zero. The paper describes the experimental work, and presents an analysis of the results

    Text-Independent Automatic Speaker Identification Using Partitioned Neural Networks

    Get PDF
    This dissertation introduces a binary partitioned approach to statistical pattern classification which is applied to talker identification using neural networks. In recent years artificial neural networks have been shown to work exceptionally well for small but difficult pattern classification tasks. However, their application to large tasks (i.e., having more than ten to 20 categories) is limited by a dramatic increase in required training time. The time required to train a single network to perform N-way classification is nearly proportional to the exponential of N. In contrast, the binary partitioned approach requires training times on the order of N2. Besides partitioning, other related issues were investigated such as acoustic feature selection for speaker identification and neural network optimization. The binary partitioned approach was used to develop an automatic speaker identification system for 120 male and 130 female speakers of a standard speech data base. The system performs with 100% accuracy in a text-independent mode when trained with about nine to 14 seconds of speech and tested with six to eight seconds of speech

    Localization and Selection of Speaker Specific Information with Statistical Modeling

    Get PDF
    International audienceStatistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. Particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knockout) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity fo

    Linguistically-constrained formant-based i-vectors for automatic speaker recognition

    Full text link
    This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work

    Detección automática de la enfermedad de Parkinson usando componentes moduladoras de señales de voz

    Get PDF
    Parkinson’s Disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease. This disorder mainly affects older adults at a rate of about 2%, and about 89% of people diagnosed with PD also develop speech disorders. This has led scientific community to research information embedded in speech signal from Parkinson’s patients, which has allowed not only a diagnosis of the pathology but also a follow-up of its evolution. In recent years, a large number of studies have focused on the automatic detection of pathologies related to the voice, in order to make objective evaluations of the voice in a non-invasive manner. In cases where the pathology primarily affects the vibratory patterns of vocal folds such as Parkinson’s, the analyses typically performed are sustained over vowel pronunciations. In this article, it is proposed to use information from slow and rapid variations in speech signals, also known as modulating components, combined with an effective dimensionality reduction approach that will be used as input to the classification system. The proposed approach achieves classification rates higher than 88  %, surpassing the classical approach based on Mel Cepstrals Coefficients (MFCC). The results show that the information extracted from slow varying components is highly discriminative for the task at hand, and could support assisted diagnosis systems for PD.La Enfermedad de Parkinson (EP) es el segundo trastorno neurodegenerativo más común después de la enfermedad de Alzheimer. Este trastorno afecta principalmente a los adultos mayores con una tasa de aproximadamente el 2%, y aproximadamente el 89% de las personas diagnosticadas con EP también desarrollan trastornos del habla. Esto ha llevado a la comunidad científica a investigar información embebida en las señales de voz de pacientes diagnosticados con la EP, lo que ha permitido no solo un diagnóstico de la patología sino también un seguimiento de su evolución. En los últimos años, una gran cantidad de estudios se han centrado en la detección automática de patologías relacionadas con la voz, a fin de realizar evaluaciones objetivas de manera no invasiva. En los casos en que la patología afecta principalmente los patrones vibratorios de las cuerdas vocales como el Parkinson, los análisis que se realizan típicamente sobre grabaciones de vocales sostenidas. En este artículo, se propone utilizar información de componentes con variación lenta de las señales de voz, también conocidas como componentes de modulación, combinadas con un enfoque efectivo de reducción de dimensiónalidad que se utilizará como entrada al sistema de clasificación. El enfoque propuesto logra tasas de clasificación superiores al 88  %, superando el enfoque clásico basado en los Coeficientes Cepstrales de Mel (MFCC). Los resultados muestran que la información extraída de componentes que varían lentamente es altamente discriminatoria para el problema abordado y podría apoyar los sistemas de diagnóstico asistido para EP
    corecore