38 research outputs found

    Frame-level features conveying phonetic information for language and speaker recognition

    Get PDF
    150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

    PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION

    Get PDF
    Práce pojednává o fonotaktickém a akustickém přístupu pro automatické rozpoznávání jazyka. První část práce pojednává o fonotaktickém přístupu založeném na výskytu fonémových sekvenci v řeči. Nejdříve je prezentován popis vývoje fonémového rozpoznávače jako techniky pro přepis řeči do sekvence smysluplných symbolů. Hlavní důraz je kladen na dobré natrénování fonémového rozpoznávače a kombinaci výsledků z několika fonémových rozpoznávačů trénovaných na různých jazycích (Paralelní fonémové rozpoznávání následované jazykovými modely (PPRLM)). Práce také pojednává o nové technice anti-modely v PPRLM a studuje použití fonémových grafů místo nejlepšího přepisu. Na závěr práce jsou porovnány dva přístupy modelování výstupu fonémového rozpoznávače -- standardní n-gramové jazykové modely a binární rozhodovací stromy. Hlavní přínos v akustickém přístupu je diskriminativní modelování cílových modelů jazyků a první experimenty s kombinací diskriminativního trénování a na příznacích, kde byl odstraněn vliv kanálu. Práce dále zkoumá různé druhy technik fúzi akustického a fonotaktického přístupu. Všechny experimenty jsou provedeny na standardních datech z NIST evaluaci konané v letech 2003, 2005 a 2007, takže jsou přímo porovnatelné s výsledky ostatních skupin zabývajících se automatickým rozpoznáváním jazyka. S fúzí uvedených technik jsme posunuli state-of-the-art výsledky a dosáhli vynikajících výsledků ve dvou NIST evaluacích.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Phonotactic language recognition using i-vectors and phoneme posteriogram counts

    Get PDF
    This paper describes a novel approach to phonotactic LID, where instead of using soft-counts based on phoneme lattices, we use posteriogram to obtain n-gram counts. The high-dimensional vectors of counts are reduced to low-dimensional units for which we adapted the commonly used term i-vectors. The reduction is based on multinomial subspace modeling and is designed to work in the total-variability space. The proposed technique was tested on the NIST 2009 LRE set with better results to a system based on using soft-counts (Cavg on 30s: 3.15% vs 3.43%), and with very good results when fused with an acoustic i-vector LID system (Cavg on 30s acoustic 2.4% vs 1.25%). The proposed technique is also compared with another low dimensional projection system based on PCA. In comparison with the original soft-counts, the proposed technique provides better results, reduces the problems due to sparse counts, and avoids the process of using pruning techniques when creating the lattices

    Intersession Variability Compensation in Language and Speaker Identification

    Get PDF
    Variabilita kanálu a hovoru je velmi důležitým problémem v úloze rozpoznávání mluvčího. V současné době je ve velkém množství vědeckých článků uvedeno několik technik pro kompenzaci vlivu kanálu. Kompenzace vlivu kanálu může být implementována jak v doméně modelu, tak i v doménách příznaků i skóre. Relativně nová výkoná technika je takzvaná eigenchannel adaptace pro GMM (Gaussian Mixture Models). Mevýhodou této metody je nemožnost její aplikace na jiné klasifikátory, jako napřílad takzvané SVM (Support Vector Machines), GMM s různým počtem Gausových komponent nebo v rozpoznávání řeči s použitím skrytých markovových modelů (HMM). Řešením může být aproximace této metody, eigenchannel adaptace v doméně příznaků. Obě tyto techniky, eigenchannel adaptace v doméně modelu a doméně příznaků v systémech rozpoznávání mluvčího, jsou uvedeny v této práci. Po dosažení dobrých výsledků v rozpoznávání mluvčího, byl přínos těchto technik zkoumán pro akustický systém rozpoznávání jazyka zahrnující 14 jazyků. V této úloze má nežádoucí vliv nejen variabilita kanálu, ale i variabilita mluvčího. Výsledky jsou prezentovány na datech definovaných pro evaluaci rozpoznávání mluvčího z roku 2006 a evaluaci rozpoznávání jazyka v roce 2007, obě organizované Amerických Národním Institutem pro Standard a Technologie (NIST)Varibiality in the channel and session is an important issue in the text-independent speaker recognition task. To date, several techniques providing channel and session variability compensation were introduced in a number of scientic papers. Such implementation can be done in feature, model and score domain. Relatively new and powerful approach to remove channel distortion is so-called eigenchannel adaptation for Gaussian Mixture Models (GMM). The drawback of the technique is that it is not applicable in its original implementation to different types of classifiers, eg. Support Vector Machines (SVM), GMM with different number of Gaussians or in speech recognition task using Hidden Markov Models (HMM). The solution can be the approximation of the technique, eigenchannel adaptation in feature domain. Both, the original eigenchannel adaptation and eigenchannel adaptation on features in task of speaker recognition are presented. After achieving good results in speaker recognition, contribution of the same techniques was examined in acoustic language identification system with 1414 languages. In this task undesired factors are channel and speaker variability. Presented results are presented on the NIST Speaker Recognition Evaluation 2006 data and NIST Language Recognition Evaluation 2007 data.

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification

    Full text link

    Acoustic model selection for recognition of regional accented speech

    Get PDF
    Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices

    Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

    Get PDF
    Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

    Knowledge Modelling and Learning through Cognitive Networks

    Get PDF
    One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot
    corecore