578 research outputs found

    Amharic Speech Recognition for Speech Translation

    No full text
    International audienceThe state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Basic Traveler Expression Corpus (BTEC) under a normal working environment. In our ASR experiments phoneme and syllable units are used for acoustic models, while morpheme and word are used for language models. Encouraging ASR results are achieved using morpheme-based language models and phoneme-based acoustic models with a recognition accuracy result of 89.1%, 80.9%, 80.6%, and 49.3% at character, morph, word and sentence level respectively. We are now working towards designing Amharic-English speech translation through cascading components under different error correction algorithms

    A Combined Approach towards Measuring Linguistic Distance: A Study on South Ethiosemitic Languages

    Get PDF
    The distance among closely related languages is usually measured from three dimensions: structural, functional and perceptual. The structural distance is determined by directly quantifying the phonetic, lexical, morphological and syntactic differences among the languages. The functional distance is measured based on the actual usage of the languages, e.g., mutual intelligibility and inter-lingual comprehensibility. The perceptual distance is related to the subjective judgment of the speakers about the similarity or intelligibility between their native language and the neighboring related languages. Studies on language variation measure linguistic distances at least from one of these dimensions. However, as Gooskens (2018) and Tang and Heuven (2009) noticed, languages do not differ just in one dimension; they can be, for example, phonetically similar but syntactically different. The present study, therefore, combined these three perspectives to examine the distance among purposely selected ten South Ethiosemitic languages (Chaha, Endegagn, Ezha, Gumer, Gura, Inor, Kistane, Mesqan, Muher and Silt'e). The study aims to (1) determine the areal classification of the languages; (2) illustrate the similarity or difference between the areal classification of the languages and previous classification by historical linguists; (3) determine the degree of mutual intelligibility among the languages; (4) examine the relationship among the three dimensions of linguistic distances, and (5) explore major determinants (linguistic and non-linguistic) which contribute to the linguistic distance among the languages. The structural distance was determined by computing the lexical and phonetic differences based on randomly selected 240 words. The lexical distance was defined as the average of pairs of non-cognates in the basic vocabularies. Levenshtein algorithm (Heeringa, 2004; Kessler, 1995) was used to compute the phonetic distance. The phonetic distance was defined as an operation that is required to transform a form of sequence of phones. Semantic Word Categorization test was adapted from Tang and Heuven (2009) to measure the functional distance. Self-rating test, based on the recordings of \u2018the North Wind and the Sun\u2019, was administered to determine the perceptual distance among the languages. With regard to the linguistic determinants, the degree of diffusion of the phonetic and lexical features was estimated using Neighbor-net network representation and lexicostatistical skewing. The study also examined the influences of four non-linguistic determinants: geographical distance, population size, the degree of contact among the speakers and language attitude. Gabmap was used for clustering and cluster validation. Multidimensional scaling and fuzzy clustering were employed for the cluster validation. The classifications obtained from each of the distance matrices were compared to the previous classifications (by historical linguists) based on the cophenetic distance among various sub-groupings. The results of the cluster analysis show that the ten selected South Ethiosemitic language varieties can be fairly grouped into five: {Chaha, Ezha, Gumer, Gura}, {Mesqan, Muher}, {Endegagn, Inor}, {Kistane} and {Silt'e}. This classification is very similar to the classifications previously proposed by historical linguists (e.g. Hetzron (1972, 1977). There is also very strong correlation among the measures of the three dimensions of distance. However, these measures have different degree of reliability; the structural distance is the most reliable measure while the perceptual distance is the least reliable distance measure. Furthermore, the Word Categorization test results show that many of these languages are mutually intelligible. Silt\u2019e is not mutually intelligible with any of the languages investigated in the present study. The results obtained from the analysis of the linguistic determinants show that the similarity among the language varieties is mainly the result of the contact among the languages. Moreover, the results of the analysis of the non-linguistic variables indicate a strong positive correlation between the geographical distance and linguistics distance, and positive contribution of the contact among the speakers. Nevertheless, there is no significant correlation between the linguistic distance and population size. Besides, among the three dimensions of measuring linguistic distance, it is the perceptual distance that is most affected by the attitude of the speakers

    How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Full text link
    The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.Comment: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this wor

    Stimulated training for automatic speech recognition and keyword search in limited resource conditions

    Get PDF
    © 2017 IEEE. Training neural network acoustic models on limited quantities of data is a challenging task. A number of techniques have been proposed to improve generalisation. This paper investigates one such technique called stimulated training. It enables standard criteria such as cross-entropy to enforce spatial constraints on activations originating from different units. Having different regions being active depending on the input unit may help network to discriminate better and as a consequence yield lower error rates. This paper investigates stimulated training for automatic speech recognition of a number of languages representing different families, alphabets, phone sets and vocabulary sizes. In particular, it looks at ensembles of stimulated networks to ensure that improved generalisation will withstand system combination effects. In order to assess stimulated training beyond 1-best transcription accuracy, this paper looks at keyword search as a proxy for assessing quality of lattices. Experiments are conducted on IARPA Babel program languages including the surprise language of OpenKWS 2016 competition

    Sensitivity to Consonantal Context in Reading English Vowels: The Case of Arabic Learners

    Get PDF
    Both experimental and anecdotal evidence document the difficulty Arabic learners of English demonstrate when learning to read and write in English. The complex phoneme-grapheme mapping rules for English may explain this difficulty in part, but the question remains why Arabic learners in particular have difficulty decoding English. This dissertation attempts to pinpoint what specific sub-word processes may contribute to this observed difficulty Arabic learners of English commonly experience. Vowel processing is an appropriate place to begin given the inconsistency of the grapheme-phoneme mapping rules for English vowels. The statisical patterns of the English language itself for the relationship between the onset and vowel or vowel and coda greatly enhance the likelihood of a particular vowel pronunciation, reducing the inconsistency for vowel grapheme-phoneme mappings. When reading, native English speakers use the context (preceding and following consonants) in which a vowel occurs to narrow the range of possible pronunciations, and thus are said to demonstrate sensitivity to consonantal context. For this dissertation, sensitivity to consonantal context in reading English vowels was tested in three groups (Arabic speakers, native English speakers, and speakers from other language backgrounds) using an experiment based on prior studies of native English speakers. Results indicate that non-native speakers of English show less sensitivity to consonantal context than native English speakers, especially in the greater use of the critical vowel pronunciation in control contexts. Furthermore, Arabic speakers show even less sensitivity to consonantal context than both the native English speakers and speakers from other language backgrounds, especially for vowel-to-coda associations. In fact, the results for the Arabic speakers for three of six vowel-to-coda test cases run counter to the expected outcome, resulting in what might be called an anti-sensitivity to consonantal context. The small number of participants in the Arabic group limits the ability to draw a strong conclusion, but that the results for the Arabic group run opposite the expected outcome for some test items warrants future study

    Automatic Speech Recognition without Transcribed Speech or Pronunciation Lexicons

    Get PDF
    Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is of great interest and importance for intelligence gathering, as well as for humanitarian assistance and disaster relief (HADR). Deploying ASR systems in these languages often relies on cross-lingual acoustic modeling followed by supervised adaptation and almost always assumes that either a pronunciation lexicon using the International Phonetic Alphabet (IPA), and/or some amount of transcribed speech exist in the new language of interest. For many languages, neither requirement is generally true -- only a limited amount of text and untranscribed audio is available. This work focuses specifically on scalable techniques for building ASR systems in most languages without any existing transcribed speech or pronunciation lexicons. We first demonstrate how cross-lingual acoustic model transfer, when phonemic pronunciation lexicons do exist in a new language, can significantly reduce the need for target-language transcribed speech. We then explore three methods for handling languages without a pronunciation lexicon. First we examine the effectiveness of graphemic acoustic model transfer, which allows for pronunciation lexicons to be trivially constructed. We then present two methods for rapid construction of phonemic pronunciation lexicons based on submodular selection of a small set of words for manual annotation, or words from other languages for which we have IPA pronunciations. We also explore techniques for training sequence-to-sequence models with very small amounts of data by transferring models trained on other languages, and leveraging large unpaired text corpora in training. Finally, as an alternative to acoustic model transfer, we present a novel hybrid generative/discriminative semi-supervised training framework that merges recent progress in Energy Based Models (EBMs) as well as lattice-free maximum mutual information (LF-MMI) training, capable of making use of purely untranscribed audio. Together, these techniques enabled ASR capabilities that supported triage of spoken communications in real-world HADR work-flows in many languages using fewer than 30 minutes of transcribed speech. These techniques were successfully applied in multiple NIST evaluations and were among the top-performing systems in each evaluation
    • …
    corecore