77 research outputs found
A Combined Approach towards Measuring Linguistic Distance: A Study on South Ethiosemitic Languages
The distance among closely related languages is usually measured from three dimensions: structural, functional and perceptual. The structural distance is determined by directly quantifying the phonetic, lexical, morphological and syntactic differences among the languages. The functional distance is measured based on the actual usage of the languages, e.g., mutual intelligibility and inter-lingual comprehensibility. The perceptual distance is related to the subjective judgment of the speakers about the similarity or intelligibility between their native language and the neighboring related languages. Studies on language variation measure linguistic distances at least from one of these dimensions. However, as Gooskens (2018) and Tang and Heuven (2009) noticed, languages do not differ just in one dimension; they can be, for example, phonetically similar but syntactically different. The present study, therefore, combined these three perspectives to examine the distance among purposely selected ten South Ethiosemitic languages (Chaha, Endegagn, Ezha, Gumer, Gura, Inor, Kistane, Mesqan, Muher and Silt'e). The study aims to (1) determine the areal classification of the languages; (2) illustrate the similarity or difference between the areal classification of the languages and previous classification by historical linguists; (3) determine the degree of mutual intelligibility among the languages; (4) examine the relationship among the three dimensions of linguistic distances, and (5) explore major determinants (linguistic and non-linguistic) which contribute to the linguistic distance among the languages. The structural distance was determined by computing the lexical and phonetic differences based on randomly selected 240 words. The lexical distance was defined as the average of pairs of non-cognates in the basic vocabularies. Levenshtein algorithm (Heeringa, 2004; Kessler, 1995) was used to compute the phonetic distance. The phonetic distance was defined as an operation that is required to transform a form of sequence of phones. Semantic Word Categorization test was adapted from Tang and Heuven (2009) to measure the functional distance. Self-rating test, based on the recordings of \u2018the North Wind and the Sun\u2019, was administered to determine the perceptual distance among the languages. With regard to the linguistic determinants, the degree of diffusion of the phonetic and lexical features was estimated using Neighbor-net network representation and lexicostatistical skewing. The study also examined the influences of four non-linguistic determinants: geographical distance, population size, the degree of contact among the speakers and language attitude. Gabmap was used for clustering and cluster validation. Multidimensional scaling and fuzzy clustering were employed for the cluster validation. The classifications obtained from each of the distance matrices were compared to the previous classifications (by historical linguists) based on the cophenetic distance among various sub-groupings. The results of the cluster analysis show that the ten selected South Ethiosemitic language varieties can be fairly grouped into five: {Chaha, Ezha, Gumer, Gura}, {Mesqan, Muher}, {Endegagn, Inor}, {Kistane} and {Silt'e}. This classification is very similar to the classifications previously proposed by historical linguists (e.g. Hetzron (1972, 1977). There is also very strong correlation among the measures of the three dimensions of distance. However, these measures have different degree of reliability; the structural distance is the most reliable measure while the perceptual distance is the least reliable distance measure. Furthermore, the Word Categorization test results show that many of these languages are mutually intelligible. Silt\u2019e is not mutually intelligible with any of the languages investigated in the present study. The results obtained from the analysis of the linguistic determinants show that the similarity among the language varieties is mainly the result of the contact among the languages. Moreover, the results of the analysis of the non-linguistic variables indicate a strong positive correlation between the geographical distance and linguistics distance, and positive contribution of the contact among the speakers. Nevertheless, there is no significant correlation between the linguistic distance and population size. Besides, among the three dimensions of measuring linguistic distance, it is the perceptual distance that is most affected by the attitude of the speakers
Hybrid language models for speech transcription
International audienceThis paper analyzes the use of hybrid language models for automatic speech transcription. The goal is to later use such an approach as a support for helping communication with deaf people, and to run it on an embedded decoder on a portable device, which introduces constraints on the model size. The main linguistic units considered for this task are the words and the syllables. Various lexicon sizes are studied by setting thresholds on the word occurrence frequencies in the training data, the less frequent words being therefore syllabified. A recognizer using this kind of language model can output between 62% and 96% of words (with respect to the thresholds on the word occurrence frequencies; the other recognized lexical units are syllables). By setting different thresholds on the confidence measures associated to the recognized words, the most reliable word hypotheses can be identified, and they have correct recognition rates between 70% and 92%
Recommended from our members
Text-to-Speech Synthesis Using Found Data for Low-Resource Languages
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices.
In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices.
In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches:
1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible.
2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model.
3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself.
We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data
- …