379 research outputs found

    On consonant frequency in Egyptian and other languages

    Get PDF

    Multi-Agent Simulation of Emergence of Schwa Deletion Pattern in Hindi

    Get PDF
    Recently, there has been a revival of interest in multi-agent simulation techniques for exploring the nature of language change. However, a lack of appropriate validation of simulation experiments against real language data often calls into question the general applicability of these methods in modeling realistic language change. We try to address this issue here by making an attempt to model the phenomenon of schwa deletion in Hindi through a multi-agent simulation framework. The pattern of Hindi schwa deletion and its diachronic nature are well studied, not only out of general linguistic inquiry, but also to facilitate Hindi grapheme-to-phoneme conversion, which is a preprocessing step to text-to-speech synthesis. We show that under certain conditions, the schwa deletion pattern observed in modern Hindi emerges in the system from an initial state of no deletion. The simulation framework described in this work can be extended to model other phonological changes as well.Language Change, Linguistic Agent, Language Game, Multi-Agent Simulation, Schwa Deletion

    PERCEPTION OF CONSONANT LENGTH OPPOSITION IN HUNGARIAN STOP CONSONANTS

    Get PDF

    Phonetic Dictionary for Natural Language Processing: Kannada

    Get PDF
    India has 22 officially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. Clearly, India owns the language diversity problem. In the age of Internet, the multiplicity of languages makes it even more necessary to have sophisticated Systems for Natural Language Process. In this paper we are developing the phonetic dictionary for natural language processing particularly for Kannada. Phonetics is the scientific study of speech sounds. Acoustic phonetics studies the physical properties of sounds and provides a language to distinguish one sound from another in quality and quantity. Kannada language is one of the major Dravidian languages of India. The language uses forty nine phonemic letters, divided into three groups: Swaragalu (thirteen letters); Yogavaahakagalu (two letters); and Vyanjanagalu (thirty-four letters), similar to the vowels and consonants of English, respectively

    An Overview of Indian Spoken Language Recognition from Machine Learning Perspective

    Get PDF
    International audienceAutomatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields

    Production of Bangla stops by native English speakers learning Bangla: An acoustic analysis

    Get PDF
    Differences in the phonetic and phonological systems of Bangla and English result in negative transfer in the Bangla stop productions of native English speakers. The phonetic realizations of Voice and Aspiration and their interactions with each other are the key factors in this. A production study was carried out focusing on sixteen of the twenty Bangla stops that are distinguished by a four-way voice/aspiration contrast at four different places of articulation, providing a contrastive acoustic analysis of the pronunciation of L1 and L2 adult speakers. Data containing these stops in an intervocalic environment in word-initial, word-medial, and word-final positions was elicited by digital recording from twelve native Bangla speakers and twelve native English speakers. The data from the L1 speakers was analyzed to investigate production characteristics related to the following acoustic variables: vowel voicing onset time, closure duration, closure voicing, preceding vowel duration, and duration of aspiration noise. The data from the L2 speakers was then analyzed using the same variables. The primary acoustic correlates of Voice and Aspiration in Bangla were found to be closure voicing and vowel voicing onset time, respectively, and the interaction of these two variables made a clear distinction between the four stop classes of Bangla: voiceless unaspirated, voiceless aspirated, voiced unaspirated, and voiced aspirated. Evidence was found supporting the work of various researchers who have suggested that a [breathy voice] feature is not necessary for a phonological description of the Indo-Aryan languages. The stop productions of the native English speakers indicated a conceptual awareness of the four stop classes, but it was also clear that they lacked a native-like control of the Voice and Aspiration features and their specific interactions with each other. The degree to which the L2 productions of the four stop classes were different from those of the L1 was directly correlated to each class’s similarity to English phonological patterns, providing evidence of certain predictable aspects of L1 transfer. In order to fully apply the results of this study in a pronunciation acquisition context, perceptual studies will need to be done to identify the salience of these acoustic variables for both L1 and L2 speakers. Perceptual studies involving L1 speakers may also give a greater understanding to the ongoing discussion on the best phonological description of the four-way stop systems of the Indo-Aryan languages

    AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese

    Full text link
    Despite their successes in NLP, Transformer-based language models still require extensive computing resources and suffer in low-resource or low-compute settings. In this paper, we present AxomiyaBERTa, a novel BERT model for Assamese, a morphologically-rich low-resource language (LRL) of Eastern India. AxomiyaBERTa is trained only on the masked language modeling (MLM) task, without the typical additional next sentence prediction (NSP) objective, and our results show that in resource-scarce settings for very low-resource languages like Assamese, MLM alone can be successfully leveraged for a range of tasks. AxomiyaBERTa achieves SOTA on token-level tasks like Named Entity Recognition and also performs well on "longer-context" tasks like Cloze-style QA and Wiki Title Prediction, with the assistance of a novel embedding disperser and phonological signals respectively. Moreover, we show that AxomiyaBERTa can leverage phonological signals for even more challenging tasks, such as a novel cross-document coreference task on a translated version of the ECB+ corpus, where we present a new SOTA result for an LRL. Our source code and evaluation scripts may be found at https://github.com/csu-signal/axomiyaberta.Comment: 16 pages, 6 figures, 8 tables, appearing in Findings of the ACL: ACL 2023. This version compiled using pdfLaTeX-compatible Assamese script font. Assamese text may appear differently here than in official ACL 2023 proceeding

    Speech Communication

    Get PDF
    Contains reports on four research projects.U. S. Air Force Cambridge Research Laboratories under Contract F19628-69-C-0044National Institutes of Health (Grant 5 RO1 NS 04332-08

    From linguistic to sociolinguistic reconstruction: the Kamta historical subgroup of Indo-Aryan

    Get PDF
    • …
    corecore