77 research outputs found

    A Combined Approach towards Measuring Linguistic Distance: A Study on South Ethiosemitic Languages

    Get PDF
    The distance among closely related languages is usually measured from three dimensions: structural, functional and perceptual. The structural distance is determined by directly quantifying the phonetic, lexical, morphological and syntactic differences among the languages. The functional distance is measured based on the actual usage of the languages, e.g., mutual intelligibility and inter-lingual comprehensibility. The perceptual distance is related to the subjective judgment of the speakers about the similarity or intelligibility between their native language and the neighboring related languages. Studies on language variation measure linguistic distances at least from one of these dimensions. However, as Gooskens (2018) and Tang and Heuven (2009) noticed, languages do not differ just in one dimension; they can be, for example, phonetically similar but syntactically different. The present study, therefore, combined these three perspectives to examine the distance among purposely selected ten South Ethiosemitic languages (Chaha, Endegagn, Ezha, Gumer, Gura, Inor, Kistane, Mesqan, Muher and Silt'e). The study aims to (1) determine the areal classification of the languages; (2) illustrate the similarity or difference between the areal classification of the languages and previous classification by historical linguists; (3) determine the degree of mutual intelligibility among the languages; (4) examine the relationship among the three dimensions of linguistic distances, and (5) explore major determinants (linguistic and non-linguistic) which contribute to the linguistic distance among the languages. The structural distance was determined by computing the lexical and phonetic differences based on randomly selected 240 words. The lexical distance was defined as the average of pairs of non-cognates in the basic vocabularies. Levenshtein algorithm (Heeringa, 2004; Kessler, 1995) was used to compute the phonetic distance. The phonetic distance was defined as an operation that is required to transform a form of sequence of phones. Semantic Word Categorization test was adapted from Tang and Heuven (2009) to measure the functional distance. Self-rating test, based on the recordings of \u2018the North Wind and the Sun\u2019, was administered to determine the perceptual distance among the languages. With regard to the linguistic determinants, the degree of diffusion of the phonetic and lexical features was estimated using Neighbor-net network representation and lexicostatistical skewing. The study also examined the influences of four non-linguistic determinants: geographical distance, population size, the degree of contact among the speakers and language attitude. Gabmap was used for clustering and cluster validation. Multidimensional scaling and fuzzy clustering were employed for the cluster validation. The classifications obtained from each of the distance matrices were compared to the previous classifications (by historical linguists) based on the cophenetic distance among various sub-groupings. The results of the cluster analysis show that the ten selected South Ethiosemitic language varieties can be fairly grouped into five: {Chaha, Ezha, Gumer, Gura}, {Mesqan, Muher}, {Endegagn, Inor}, {Kistane} and {Silt'e}. This classification is very similar to the classifications previously proposed by historical linguists (e.g. Hetzron (1972, 1977). There is also very strong correlation among the measures of the three dimensions of distance. However, these measures have different degree of reliability; the structural distance is the most reliable measure while the perceptual distance is the least reliable distance measure. Furthermore, the Word Categorization test results show that many of these languages are mutually intelligible. Silt\u2019e is not mutually intelligible with any of the languages investigated in the present study. The results obtained from the analysis of the linguistic determinants show that the similarity among the language varieties is mainly the result of the contact among the languages. Moreover, the results of the analysis of the non-linguistic variables indicate a strong positive correlation between the geographical distance and linguistics distance, and positive contribution of the contact among the speakers. Nevertheless, there is no significant correlation between the linguistic distance and population size. Besides, among the three dimensions of measuring linguistic distance, it is the perceptual distance that is most affected by the attitude of the speakers

    Hybrid language models for speech transcription

    Get PDF
    International audienceThis paper analyzes the use of hybrid language models for automatic speech transcription. The goal is to later use such an approach as a support for helping communication with deaf people, and to run it on an embedded decoder on a portable device, which introduces constraints on the model size. The main linguistic units considered for this task are the words and the syllables. Various lexicon sizes are studied by setting thresholds on the word occurrence frequencies in the training data, the less frequent words being therefore syllabified. A recognizer using this kind of language model can output between 62% and 96% of words (with respect to the thresholds on the word occurrence frequencies; the other recognized lexical units are syllables). By setting different thresholds on the confidence measures associated to the recognized words, the most reliable word hypotheses can be identified, and they have correct recognition rates between 70% and 92%
    corecore