Search CORE

401 research outputs found

Language Identification Using Visual Features

Author: Cox S
Newman J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/04/2012
Field of study

Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech

Crossref

University of East Anglia digital repository

Recommended from our members

Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification

Author: Almaadeed Noor
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2014
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content (i.e. text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system. A novel approach towards speaker identification is developed using wavelet analysis, and multiple neural networks including Probabilistic Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state- of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA). Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear. Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that the proposed scheme is one of the best candidates for the fusion of face and voice due to its low computational time and high recognition accuracy

Brunel University Research Archive

Feature extraction based on bio-inspired model for robust emotion recognition

Author: A Batliner
A Kapoor
AI Iliev
B Schuller
B Yang
C Clavel
C Martínez
C Martínez
D Giakoumis
D Morrison
Diego H. Milone
EM Albornoz
Enrique M. Albornoz
G Chanel
Hugo L. Rufiner
I Luengo Gil
J Adell Mercado
J Kim
J Kim
JR Deller Jr
K Schindler
KP Truong
M Ayadi El
M Wöllmer
N Cummins
N Mesgarani
S Koolagudi
S Shojaeilangari
S Yildirim
SA Shamma
T Chi
X Yang
Y Wang
Z Zeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2016
Field of study

Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin

Crossref

CONICET Digital

On the development of an automatic voice pleasantness classiﬁcation and intensity estimation system

Author: Braga Daniela
Coelho Luís
Garcia-Mateo Carmen
Sales-Dias Miguel
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

In the last few years, the number of systems and devices that use voice based interaction has grown signiﬁcantly. For a continued use of these systems, the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how pleasant is a voice from a perceptual point of view when the ﬁnal application is a speech based interface. In this paper we present an objective deﬁnition for voice pleasantness based on the composition of a representative feature subset and a new automatic voice pleasantness classiﬁcation and intensity estimation system. Our study is based on a database composed by European Portuguese female voices but the methodology can be extended to male voices or to other languages. In the objective performance evaluation the system achieved a 9.1% error rate for voice pleasantness classiﬁcation and a 15.7% error rate for voice pleasantness intensity estimation.Work partially supported by ERDF funds, the Spanish Government (TEC2009-14094-C04-04), and Xunta de Galicia (CN2011/019, 2009/062

Repositório Científico do Instituto Politécnico do Porto

Repositório Institucional do ISCTE-IUL

Session varaibility compensation in automatic speaker and language recognition

Author: González Domínguez Javier
Publication venue
Publication date: 01/01/2011
Field of study

Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, octubre de 201

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Acoustic Approaches to Gender and Accent Identification

Author: DeMarco Andrea
Publication venue
Publication date: 01/06/2015
Field of study

There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

University of East Anglia digital repository

Multilevel and session variability compensated language recognition: ATVS-UAM systems at NIST LRE 2009

Author: Franco-Pedroso Javier
González Domínguez Javier
González-Rodríguez Joaquín
López Moreno Ignacio
Ramos Daniel
Toledano Doroteo T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. J. Gonzalez-Dominguez, I. Lopez-Moreno, J. Franco-Pedroso, D. Ramos, D. T. Toledano, and J. Gonzalez-Rodriguez, "Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009" IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 6, pp. 1084 – 1093, December 2010This work presents the systems submitted by the ATVS Biometric Recognition Group to the 2009 Language Recognition Evaluation (LRE’09), organized by NIST. New challenges included in this LRE edition can be summarized by three main differences with respect to past evaluations. Firstly, the number of languages to be recognized expanded to 23 languages from 14 in 2007, and 7 in 2005. Secondly, the data variability has been increased by including telephone speech excerpts extracted from Voice of America (VOA) radio broadcasts through Internet in addition to Conversational Telephone Speech (CTS). The third difference was the volume of data, involving in this evaluation up to 2 terabytes of speech data for development, which is an order of magnitude greater than past evaluations. LRE’09 thus required participants to develop robust systems able not only to successfully face the session variability problem but also to do it with reasonable computational resources. ATVS participation consisted of state-of-the-art acoustic and high-level systems focussing on these issues. Furthermore, the problem of finding a proper combination and calibration of the information obtained at different levels of the speech signal was widely explored in this submission. In this work, two original contributions were developed. The first contribution was applying a session variability compensation scheme based on Factor Analysis (FA) within the statistics domain into a SVM-supervector (SVM-SV) approach. The second contribution was the employment of a novel backend based on anchor models in order to fuse individual systems prior to one-vs-all calibration via logistic regression. Results both in development and evaluation corpora show the robustness and excellent performance of the submitted systems, exemplified by our system ranked 2nd in the 30 second open-set condition, with remarkably scarce computational resources.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01. Javier Gonzalez-Dominguez also thanks Spanish Ministry of Education for supporting his doctoral research under project TEC2006-13141-C03-03. Special thanks are given to Dr. David Van Leeuwen from TNO Human Factors (Utrech, The Netherlands) for his strong collaboration, valuable discussions and ideas. Also, authors thank to Dr. Patrick Lucey for his final support on (non-target) Australian English review of the manuscript

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

Author: Lleida Solano Eduardo
Martínez González David
Miguel Artiaga Antonio
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2015
Field of study

En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Toward a personalized real-time diagnosis in neonatal seizure detection

Author: Boylan Geraldine B.
Lightbody Gordon
Marnane William P.
Mathieson Sean
Sarkar Achintya K. R.
Temko Andriy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/09/2017
Field of study

The problem of creating a personalized seizure detection algorithm for newborns is tackled in this paper. A probabilistic framework for semi-supervised adaptation of a generic patient-independent neonatal seizure detector is proposed. A system that is based on a combination of patient-adaptive (generative) and patient-independent (discriminative) classifiers is designed and evaluated on a large database of unedited continuous multichannel neonatal EEG recordings of over 800 h in duration. It is shown that an improvement in the detection of neonatal seizures over the course of long EEG recordings is achievable with on-the-fly incorporation of patient-specific EEG characteristics. In the clinical setting, the employment of the developed system will maintain a seizure detection rate at 70% while halving the number of false detections per hour, from 0.4 to 0.2 FD/h. This is the first study to propose the use of online adaptation without clinical labels, to build a personalized diagnostic system for the detection of neonatal seizures

Irish Universities

VBN

Cork Open Research Archive