38 research outputs found

    Automatic Speech Recognition without Transcribed Speech or Pronunciation Lexicons

    Get PDF
    Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is of great interest and importance for intelligence gathering, as well as for humanitarian assistance and disaster relief (HADR). Deploying ASR systems in these languages often relies on cross-lingual acoustic modeling followed by supervised adaptation and almost always assumes that either a pronunciation lexicon using the International Phonetic Alphabet (IPA), and/or some amount of transcribed speech exist in the new language of interest. For many languages, neither requirement is generally true -- only a limited amount of text and untranscribed audio is available. This work focuses specifically on scalable techniques for building ASR systems in most languages without any existing transcribed speech or pronunciation lexicons. We first demonstrate how cross-lingual acoustic model transfer, when phonemic pronunciation lexicons do exist in a new language, can significantly reduce the need for target-language transcribed speech. We then explore three methods for handling languages without a pronunciation lexicon. First we examine the effectiveness of graphemic acoustic model transfer, which allows for pronunciation lexicons to be trivially constructed. We then present two methods for rapid construction of phonemic pronunciation lexicons based on submodular selection of a small set of words for manual annotation, or words from other languages for which we have IPA pronunciations. We also explore techniques for training sequence-to-sequence models with very small amounts of data by transferring models trained on other languages, and leveraging large unpaired text corpora in training. Finally, as an alternative to acoustic model transfer, we present a novel hybrid generative/discriminative semi-supervised training framework that merges recent progress in Energy Based Models (EBMs) as well as lattice-free maximum mutual information (LF-MMI) training, capable of making use of purely untranscribed audio. Together, these techniques enabled ASR capabilities that supported triage of spoken communications in real-world HADR work-flows in many languages using fewer than 30 minutes of transcribed speech. These techniques were successfully applied in multiple NIST evaluations and were among the top-performing systems in each evaluation

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

    Pronunciation Ambiguities in Japanese Kanji

    Full text link
    Japanese writing is a complex system, and a large part of the complexity resides in the use of kanji. A single kanji character in modern Japanese may have multiple pronunciations, either as native vocabulary or as words borrowed from Chinese. This causes a problem for text-to-speech synthesis (TTS) because the system has to predict which pronunciation of each kanji character is appropriate in the context. The problem is called homograph disambiguation. In Japanese TTS technology, the trick in any case is to know which is the right reading, which makes reading Japanese text a challenge. To solve the problem, this research provides a new annotated Japanese single kanji character pronunciation data set and describes an experiment using logistic regression (LR) classifier. A baseline is computed to compare with the LR classifier accuracy. The LR classifier improves the modeling performance by 16%. This experiment provides the first experimental research in Japanese single kanji homograph disambiguation. The annotated Japanese data is freely released to the public to support further work

    Low Resource Efficient Speech Retrieval

    Get PDF
    Speech retrieval refers to the task of retrieving the information, which is useful or relevant to a user query, from speech collection. This thesis aims to examine ways in which speech retrieval can be improved in terms of requiring low resources - without extensively annotated corpora on which automated processing systems are typically built - and achieving high computational efficiency. This work is focused on two speech retrieval technologies, spoken keyword retrieval and spoken document classification. Firstly, keyword retrieval - also referred to as keyword search (KWS) or spoken term detection - is defined as the task of retrieving the occurrences of a keyword specified by the user in text form, from speech collections. We make advances in an open vocabulary KWS platform using context-dependent Point Process Model (PPM). We further accomplish a PPM-based lattice generation framework, which improves KWS performance and enables automatic speech recognition (ASR) decoding. Secondly, the massive volumes of speech data motivate the effort to organize and search speech collections through spoken document classification. In classifying real-world unstructured speech into predefined classes, the wildly collected speech recordings can be extremely long, of varying length, and contain multiple class label shifts at variable locations in the audio. For this reason each spoken document is often first split into sequential segments, and then each segment is independently classified. We present a general purpose method for classifying spoken segments, using a cascade of language independent acoustic modeling, foreign-language to English translation lexicons, and English-language classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large classification performance improvements. Lastly, we remove the need of any orthographic lexicon and instead exploit alternative unsupervised approaches to decoding speech in terms of automatically discovered word-like or phoneme-like units. We show that the spoken segment representations based on such lexical or phonetic discovery can achieve competitive classification performance as compared to those based on a domain-mismatched ASR or a universal phone set ASR

    Single-word naming in a transparent alphabetic orthography.

    Get PDF
    The cognitive processes involved in single-word naming of the transparent Turkish orthography were examined in a series of nine naming experiments on adult native readers. In Experiment 1, a significant word frequency effect was observed when matched (i.e. on initial phoneme, letter length and number of syllables) high- and low-frequency words were presented for naming. However, no frequency effect was found in Experiment 2, when an equal number of matched (i.e. on initial phoneme, letter length and number of syllables) nonword fillers were mixed with the target words. A null frequency effect was also found in Experiment 3 when conditions were mixed-blocks, i.e. high- and low frequency were words presented in separate blocks mixed with an equal number of matched nonword fillers. Experiment 4 served the purpose of creating and validating nonwords (to be used in Experiments 5 and 6) that could be named as fast as high- and low-frequency words by manipulating the letter length of nonwords. A significant word frequency effect emerged with both the mixed-block design (Experiment 5) and mixed design (Experiment 6) when the nonword fillers matched the target words in speed of naming. Experiment 7, however, found no frequency effect when high- and low-frequency words were mixed with word fillers that were slower to be named (longer in length) than the target words. In Experiment 8, frequency was factorially manipulated with imageability (high vs. low) and level of skill (very skilled vs. skilled) which found significant main effects for word frequency and level of skill, and a significant 2-way interaction of skill by imageability and a significant 3-way interaction of skill by imageability by frequency. In Experiment 9, however, there was only a main effect for frequency when previously skilled readers performed on the same words used in Experiment 8. These findings suggest that whilst a lexical route dominates in naming the transparent Turkish orthography, an explanation that the readers shut down the operation of this route in the presence of nonword fillers is not entertained. Instead, the results suggest that both routes operate in naming, with the inclusion of filler stimuli and their “perceived difficulty” having an impact in the time criterion for articulation. Moreover, there are indications that a semantic route is involved in naming Turkish only when level of skill is taken into account. Implications of these findings for models of single-word naming are discussed

    Scriptinformatics

    Get PDF
    Scripts (writing systems) usually belong to specific languages and have temporal, spatial and cultural characteristics. The evolution of scripts has been the subject of research for a long time. This is probably because the long-term development of human thinking is reflected in the surviving script relics, many of which are still undeciphered today. The book presents the study of the script evolution with the mathematical tools of systematics, phylogenetics and bioinformatics. In the research described, the script is the evolutionary taxonomic unit (taxon), which is analogous to the concept of biological species. Among the methods of phylogenetics, phenetics classifies the investigated taxa on the basis of their morphological similarity, and does not primarily examine genealogical relationships. Due to the scarcity of morphological diversity of scripts’ features, random coincidences of evolution-independent features are much more common in scripts than in biological species, thus phenetic modelling based solely on morphological features can lead to erroneous results. For this reason, phenetic modeling has been extended with evolutionary considerations, thereby allowing the modelling uncertainties observed in the script evolution to be addressed due to the large number of random coincidences (homoplasies) characterizing each script. The book describes an extended phenetic method developed to investigate the script evolution. This data-driven approach helps to reduce the impact of the uncertainties inherent in the phenetic model due to the large number of homoplasies that occur during the evolution of scripts. The elaborated phenetic and evolutionary analyses were applied to the Rovash scripts used on the Eurasian Steppe (Grassland), including the Turkic Rovash (Turkic Runic/runiform) and the SzĂ©kely-Hungarian Rovash. The evaluation of the extended phenetic model of the scripts, the various phenograms, the script spectra and the group spectra helped to reconstruct the main ancestors and evolutionary stages of the investigated scripts

    Ideas behind symbols – languages behind scripts

    Get PDF
    Proceedings of the 60th Meeting of the Permanent International Altaistic Conference (PIAC) August 27 – September 1, 2017 Székesfehérvár, Hungary, Vol 52 (2018), printed in 2019. ISBN: 9789633066638 (printed) ISBN: 9789633066645 (pdf

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    An introduction to turkology

    Get PDF
    corecore