13 research outputs found

    Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

    Full text link
    In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.Comment: Speech for Social Good Workshop, 2022, Interspeech 202

    Is Attention always needed? A Case Study on Language Identification from Speech

    Full text link
    Language Identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilization. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN) based LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the Convolutional Neural Network (CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages and our model resulted in over 98\% classification accuracy. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2% accuracy in a noisy setting when applied to a European Language (EU) dataset.Comment: Accepted for publication in Natural Language Engineerin

    Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition

    Get PDF
    Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community

    Phonetic aware techniques for Speaker Verification

    Get PDF
    The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics. More specifically, the techniques proposed in this thesis are applied to various SV tasks, namely, text-independent and text-dependent SV. For text-independent SV task, several ASR systems are developed and applied to compute phonetic posterior probabilities, subsequently exploited to enhance the speaker-specific information included in i-vectors. These approaches are then extended for text-dependent SV task, exploiting temporal information in a principled way, i.e., by using dynamic time warping applied on speaker informative vectors. Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. We improve upon this technique by computing a distance function between two utterances which takes into account common phonetic units. The whole network is optimized by employing a triplet-loss objective function. The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010 and RSR2015. Significant improvements are observed over the baseline systems on both the text-dependent and text-independent SV tasks by applying phonetic knowledge
    corecore