448 research outputs found
Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN
Call Centers have huge amount of audio data which can be used for achieving
valuable business insights and transcription of phone calls is manually tedious
task. An effective Automated Speech Recognition system can accurately
transcribe these calls for easy search through call history for specific
context and content allowing automatic call monitoring, improving QoS through
keyword search and sentiment analysis. ASR for Call Center requires more
robustness as telephonic environment are generally noisy. Moreover, there are
many low-resourced languages that are on verge of extinction which can be
preserved with help of Automatic Speech Recognition Technology. Urdu is the
most widely spoken language in the world, with 231,295,440 worldwide
still remains a resource constrained language in ASR. Regional call-center
conversations operate in local language, with a mix of English numbers and
technical terms generally causing a "code-switching" problem. Hence, this paper
describes an implementation framework of a resource efficient Automatic Speech
Recognition/ Speech to Text System in a noisy call-center environment using
Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid
HMM-DNN approach allowed us to utilize the advantages of Neural Network with
less labelled data. Adding CNN with TDNN has shown to work better in noisy
environment due to CNN's additional frequency dimension which captures extra
information from noisy speech, thus improving accuracy. We collected data from
various open sources and labelled some of the unlabelled data after analysing
its general context and content from Urdu language as well as from commonly
used words from other languages, primarily English and were able to achieve WER
of 5.2% with noisy as well as clean environment in isolated words or numbers as
well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin
Automatic Speaker Identification System for Urdu Speech
Speaker recognition is the process of recognizing a speaker from a verbal phrase. Such systems generally operates in two ways: to identify a speaker or to verify speaker’s claimed identity. Availability of valuable research material witnessed efforts paid to Automatic Speaker Identification (ASI) in East Asian, English and European languages. But unfortunately languages of South Asia especially “Urdu” have got very less attention. This paper aims to describe a new feature set for ASI in Urdu speech, achieving improved performance than baseline systems. Classifiers like Neural Net, Naïve Bayes and K nearest neighbor (K-NN) have been used for modeling. Results are provided on the dataset of 40 speakers with 82% correct identification. Lastly, improvement in system performance is also reported by changing number of recordings per speaker
Recommended from our members
Speaker recognition with hybrid features from a deep belief network
Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features
Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages
In a conventional Speech emotion recognition (SER) task, a classifier for a
given language is trained on a pre-existing dataset for that same language.
However, where training data for a language does not exist, data from other
languages can be used instead. We experiment with cross-lingual and
multilingual SER, working with Amharic, English, German and URDU. For Amharic,
we use our own publicly-available Amharic Speech Emotion Dataset (ASED). For
English, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets.
We followed previous research in mapping labels for all datasets to just two
classes, positive and negative. Thus we can compare performance on different
languages directly, and combine languages for training and testing. In
Experiment 1, monolingual SER trials were carried out using three classifiers,
AlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for
the three models were very similar for ASED and RAVDESS, suggesting that
Amharic and English SER are equally difficult. Similarly, German SER is more
difficult, and Urdu SER is easier. In Experiment 2, we trained on one language
and tested on another, in both directions for each pair: AmharicGerman,
AmharicEnglish, and AmharicUrdu. Results with Amharic as target suggested
that using English or German as source will give the best result. In Experiment
3, we trained on several non-Amharic languages and then tested on Amharic. The
best accuracy obtained was several percent greater than the best accuracy in
Experiment 2, suggesting that a better result can be obtained when using two or
three non-Amharic languages for training than when using just one non-Amharic
language. Overall, the results suggest that cross-lingual and multilingual
training can be an effective strategy for training a SER classifier when
resources for a language are scarce.Comment: 16 pages, 9 tables, 5 figure
Arabic Speaker-Independent Continuous Automatic Speech Recognition Based on a Phonetically Rich and Balanced Speech Corpus
This paper describes and proposes an efficient and effective framework for the design and development of a
speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced
speech corpus. The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native
speakers from 11 different Arab countries representing the three major regions (Levant, Gulf, and Africa) in the Arab world.
The proposed Arabic speech recognition system is based on the Carnegie Mellon University (CMU) Sphinx tools, and the
Cambridge HTK tools were also used at some testing stages. The speech engine uses 3-emitting state Hidden Markov Models
(HMM) for tri-phone based acoustic models. Based on experimental analysis of about 7 hours of training speech data, the
acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state
distributions were tied to 500 senones. The language model contains both bi-grams and tri-grams. For similar speakers but
different sentences, the system obtained a word recognition accuracy of 92.67% and 93.88% and a Word Error Rate (WER) of
11.27% and 10.07% with and without diacritical marks respectively. For different speakers with similar sentences, the system
obtained a word recognition accuracy of 95.92% and 96.29% and a WER of 5.78% and 5.45% with and without diacritical
marks respectively. Whereas different speakers and different sentences, the system obtained a word recognition accuracy of
89.08% and 90.23% and a WER of 15.59% and 14.44% with and without diacritical marks respectively
Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment
En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective
A prior case study of natural language processing on different domain
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model
- …