793 research outputs found

    Master of Science

    Get PDF
    thesisPresently, speech recognition is gaining worldwide popularity in applications like Google Voice, speech-to-text reporter (speech-to-text transcription, video captioning, real-time transcriptions), hands-free computing, and video games. Research has been done for several years and many speech recognizers have been built. However, most of the speech recognizers fail to recognize the speech accurately. Consider the well-known application of Google Voice, which aids in users search of the web using voice. Though Google Voice does a good job in transcribing the spoken words, it does not accurately recognize the words spoken with different accents. With the fact that several accents are evolving around the world, it is essential to train the speech recognizer to recognize accented speech. Accent classification is defined as the problem of classifying the accents in a given language. This thesis explores various methods to identify the accents. We introduce a new concept of clustering windows of a speech signal and learn a distance metric using specific distance measure over phonetic strings to classify the accents. A language structure is incorporated to learn this distance metric. We also show how kernel approximation algorithms help in learning a distance metric

    ACOUSTIC-PHONETIC FEATURE BASED DIALECT IDENTIFICATION IN HINDI SPEECH

    Full text link

    A Review of Accent-Based Automatic Speech Recognition Models for E-Learning Environment

    Get PDF
    The adoption of electronics learning (e-learning) as a method of disseminating knowledge in the global educational system is growing at a rapid rate, and has created a shift in the knowledge acquisition methods from the conventional classrooms and tutors to the distributed e-learning technique that enables access to various learning resources much more conveniently and flexibly. However, notwithstanding the adaptive advantages of learner-centric contents of e-learning programmes, the distributed e-learning environment has unconsciously adopted few international languages as the languages of communication among the participants despite the various accents (mother language influence) among these participants. Adjusting to and accommodating these various accents has brought about the introduction of accents-based automatic speech recognition into the e-learning to resolve the effects of the accent differences. This paper reviews over 50 research papers to determine the development so far made in the design and implementation of accents-based automatic recognition models for the purpose of e-learning between year 2001 and 2021. The analysis of the review shows that 50% of the models reviewed adopted English language, 46.50% adopted the major Chinese and Indian languages and 3.50% adopted Swedish language as the mode of communication. It is therefore discovered that majority of the ASR models are centred on the European, American and Asian accents, while unconsciously excluding the various accents peculiarities associated with the less technologically resourced continents

    Discovering Lexical Similarity Using Articulatory Feature-Based Phonetic Edit Distance

    Get PDF
    Lexical Similarity (LS) between two languages uncovers many interesting linguistic insights such as phylogenetic relationship, mutual intelligibility, common etymology, and loan words. There are various methods through which LS is evaluated. This paper presents a method of Phonetic Edit Distance (PED) that uses a soft comparison of letters using the articulatory features associated with their International Phonetic Alphabet (IPA) transcription. In particular, the comparison between the articulatory features of two letters taken from words belonging to different languages is used to compute the cost of replacement in the inner loop of edit distance computation. As an example, PED gives edit distance of 0.82 between German word ‘vater’ ([fa:tər]) and Persian word ‘ ’ ([pedær]), meaning ‘father,’ and, similarly, PED of 0.93 between Hebrew word ‘ ’ ([ʃəɭam]) and Arabic word ‘ ’ ([səɭa:m], meaning ‘peace,’ whereas classical edit distances would be 4 and 2, respectively. We report the results of systematic experiments conducted on six languages: Arabic, Hindi, Marathi, Persian, Sanskrit, and Urdu. Universal Dependencies (UD) corpora were used to restrict the comparison to lists of words belonging to the same part of speech. The LS based on the average PED between pair of words was then computed for each pair of languages, unveiling similarities otherwise masked by the adoption of different alphabets, grammars, and pronunciations rules

    A computational model for studying L1’s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    On The Way To Linguistic Representation: Neuromagnetic Evidence of Early Auditory Abstraction in the Perception of Speech and Pitch

    Get PDF
    The goal of this dissertation is to show that even at the earliest (non-invasive) recordable stages of auditory cortical processing, we find evidence that cortex is calculating abstract representations from the acoustic signal. Looking across two distinct domains (inferential pitch perception and vowel normalization), I present evidence demonstrating that the M100, an automatic evoked neuromagnetic component that localizes to primary auditory cortex is sensitive to abstract computations. The M100 typically responds to physical properties of the stimulus in auditory and speech perception and integrates only over the first 25 to 40 ms of stimulus onset, providing a reliable dependent measure that allows us to tap into early stages of auditory cortical processing. In Chapter 2, I briefly present the episodicist position on speech perception and discuss research indicating that the strongest episodicist position is untenable. I then review findings from the mismatch negativity literature, where proposals have been made that the MMN allows access into linguistic representations supported by auditory cortex. Finally, I conclude the Chapter with a discussion of the previous findings on the M100/N1. In Chapter 3, I present neuromagnetic data showing that the re-sponse properties of the M100 are sensitive to the missing fundamental component using well-controlled stimuli. These findings suggest that listeners are reconstructing the inferred pitch by 100 ms after stimulus onset. In Chapter 4, I propose a novel formant ratio algorithm in which the third formant (F3) is the normalizing factor. The goal of formant ratio proposals is to provide an explicit algorithm that successfully "eliminates" speaker-dependent acoustic variation of auditory vowel tokens. Results from two MEG experiments suggest that auditory cortex is sensitive to formant ratios and that the perceptual system shows heightened sensitivity to tokens located in more densely populated regions of the vowel space. In Chapter 5, I report MEG results that suggest early auditory cortical processing is sensitive to violations of a phonological constraint on sound sequencing, suggesting that listeners make highly specific, knowledge-based predictions about rather abstract anticipated properties of the upcoming speech signal and violations of these predictions are evident in early cortical processing

    Recognition of Correct Pronunciation for Arabic Letters Using Artificial Neural Networks

    Get PDF
    Automatic speech recognition (ASR) plays an important role in taking technology to the people. There are numerous applications of speech recognition such as direct voice input in aircraft, data entry and speech-to-text processing. The aim of this paper was to develop a voice-learning model for correct Arabic letter pronunciation using machine learning algorithms. The system was designed and implemented through three different phases: signal preprocessing, feature extraction and feature classification. MATLAB platform was used for feature extraction of voice using Mel Frequency Cepstrum Coefficients (MFCC). Matrix of MFCC features was applied to back propagation neural networks for Arabic letter features classification. The overall accuracy obtained from this classification was 65% with an error of 35% for one consonant letter, 87% accuracy and an error of 13% for 10 isolated different letters and 6 vowels each and finally 95% accuracy and an error of 5% for 66 different examples of one letter (vowels, words and sentences) stored in one voice file

    The Effect of Bilingual Proficiency in Indian English on Bilabial Plosive

    Get PDF
    Background: Bilingual speech production studies have highlighted that level of proficiency influences the acoustic-phonetic representation of phonemes in both languages (MacKay, Flege, Piske, & Schirru 2001; Zárate-Sández, 2015). The results for bilingual speech production reveal that proficient/early bilinguals produce distinct acoustic properties for the same phoneme in each language, whereas less proficient/late bilinguals produce acoustic properties for a phoneme that is closer to the native language (Flege et al., 2003; Fowler et al., 2008). Acoustic-phonetic studies for Hindi (L1) and Indian English (L2) for bilingual speakers have been understudied, and the level of proficiency has not been considered in Hindi and Indian English bilingual speakers. The present study aimed to measure the acoustic differences produced by bilingual speakers of varying proficiencies for Indian English on bilabial plosive and determine how the bilabial plosives are different from American English bilabial plosives. Methods: The sample size for this study was twenty-four. However, only twenty participants (eleven females) between the ages of eighteen and fifty, with normal speech and hearing, were recruited. The lack of recruitment of four more participants was due to the inability to find bilingual speakers who spoke Hindi as their first language and Indian English as their second language and COVID-19 restrictions imposed on recruitment (n=4). The participants were divided into three groups based on language and proficiency: a monolingual American English group, a proficient bilingual Hindi-Indian English group, and a less-proficient bilingual Hindi-Indian English group. The bilinguals were divided into a proficient and less proficient group based on the Language Experience and Proficiency Questionnaire (Marian, Blumenfeld, & Kaushanskaya, 2007). Following the screening, participants took part in a Nonword Repetition Task. Data were analyzed using Praat and Voice Sauce software. A linear mixed-effects model using R statistics was used for the statistical analysis. Results: Data from 20 participants (seven proficient bilingual speakers, five less-proficient bilingual speakers, and eight monolingual speakers) were included in the data analysis. Approximately four thousand repetitions were evaluated across the remaining participants. There were no significant main effects across the four dependent variables, but there was an interaction effect between group and phoneme on two dependent variables. The closure duration for proficient bilingual speakers compared to less-proficient bilingual speakers were significantly different between the voiceless unaspirated bilabial plosive (VLE) and voiceless aspirated bilabial plosive (VLH), as well as voiced unaspirated bilabial plosive (VE) and voiced aspirated bilabial plosive (VH). For spectral tilt, there was a significant difference between the VLE and VLH for proficient bilingual speakers compared to less proficient bilingual speakers. Discussion: The results of this study suggest that proficient bilingual speakers have a faster rate of speech in both their first language and second language. Therefore, it is difficult to provide information on whether this group has separate acoustic-phonetic characteristics for each phoneme for each language. In contrast, the less-proficient bilingual speakers seem to have a unidirectional relationship (i.e., first language influences the second language). Furthermore, the results of the acoustic characteristics for the control group i.e., monolingual American English speakers suggest that they may have acoustic-phonetic characteristics that represent a single acoustic-phonetic representation of bilabial plosive with their voicing contrast

    Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

    Full text link
    Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited parallel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi -- Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating effective parallel corpora for resource-constrained languages, and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices
    • …
    corecore