17 research outputs found

    The Dilemma of the Bishnupriya Identity

    Get PDF
    Ethnic identity is a dynamic, multidimensional construct thatrefers to one's identity, or sense of self, as a member of an ethnicgroup. The reconstruction of an identity interacts with historical andsocial identities in the contemporary world. What is intend to discussin this article is the reconstruction of the Bishnupriya identity inManipur, and study it against the Bishnupriyas living outsideManipur

    North East Indian linguistics 6

    No full text
    The papers for this volume were initially presented at the sixth and seventh meetings of the North East Indian Linguistics Society, held in Guwahati, India, in 2011 and 2012. As with previous conferences, these meetings were held at the Don Bosco Institute in Guwahati, Assam, and hosted in collaboration with Gauhati University. The present collection of papers are testament to the ongoing interest in North East India and continued success and growth in the community of North East Indian linguists. As in previous volumes, all the papers here were reviewed by leading international specialists in the relevant subfields. This volume, in particular, highlights the recent research of many scholars from the region. Out of eleven contributions, eight are from North East Indian scholars themselves. This book therefore brightly shines light on the work being done by North East Indian linguists on the languages of their own region. The remaining contributions are authored by international scholars from Australia, Singapore, Germany/USA, and Nepal

    Word forms are structured for efficient use

    Get PDF
    Zipf famously stated that, if natural language lexicons are structured for efficient communication, the words that are used the most frequently should require the least effort. This observation explains the famous finding that the most frequent words in a language tend to be short. A related prediction is that, even within words of the same length, the most frequent word forms should be the ones that are easiest to produce and understand. Using orthographics as a proxy for phonetics, we test this hypothesis using corpora of 96 languages from Wikipedia. We find that, across a variety of languages and language families and controlling for length, the most frequent forms in a language tend to be more orthographically well‐formed and have more orthographic neighbors than less frequent forms. We interpret this result as evidence that lexicons are structured by language usage pressures to facilitate efficient communication. Keywords: Lexicon; Word frequency; Phonology; Communication; EfficiencyNational Science Foundation (Grant ES/N0174041/1

    Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

    Get PDF
    This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Language and Culture in Northeast India and Beyond: In Honor of Robbins Burling

    Get PDF
    This volume celebrates the life and work of Robbins Burling, Emeritus Professor of Anthropology and Linguistics at the University of Michigan, giant in the fields of anthropological linguistics, language evolution, and language pedagogy, and pioneer in the ethnography and linguistics of Tibeto-Burmanspeaking groups in the Northeast Indian region. We offer it to Professor Burling – Rob – on the occasion of his 90th birthday, on the occasion of the 60th year of his extraordinary scholarly productivity, and on the occasion of yet another – yet another! – field trip to Northeast India, where his career in anthropology and linguistics effectively began so many decades ago, and where he has amassed so many devoted friends and colleagues – including ourselves. (First paragraph of Editor's Introduction)
    corecore