12 research outputs found

    Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

    Get PDF
    International audienceWe introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    NUWT: Jawi-specific Buckwalter corpus for Malays word tokenization

    Get PDF
    This paper describes the design and creation of a monolingual parallel corpus for the Malay language written in Jawi.This paper proposes a new corpus called the National University of Malaysia Word Tokenization (NUWT) corpora To the best of our knowledge, currently, there is no sufficiently comprehensive, well-designed standard corpus that is annotated and made available for the public for the Jawi script corpora.This corpus contains the Jawi-specific Buckwalter character code and can be used to evaluate the performance of word tokenization tasks, as well as further language processing.The objective of this work is to conform and standardize the corpora between similar characters in Jawi.It consists of three subcorporas with documents from different genres. The gathering and processing steps, as well as the definition of several evaluation tasks regarding the use of these corpora, are included in this paper.One of the important roles and fundamental tasks of the corpus, which is the tokenization, is also presented in this paper.The development of the Malay language tokenizer is based on the syntactic data compatibility of Malay words written in Jawi.A series of experiments were performed to validate the corpus and to fulfill the requirement of the Jawi script tokenizer with an average error rate of 0.020255.Based on this promising result, the token will be used for the disambiguation and unknown word resolution, such as out-of vocabulary (OOV) problem in the tagging process

    The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

    Get PDF
    Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009

    Endangered Languages and Languages in Danger

    Get PDF
    This peer-reviewed collection brings together the latest research on language endangerment and language rights. It creates a vibrant, interdisciplinary platform for the discussion of the most pertinent and urgent topics central to vitality and equality of languages in today’s globalised world. The novelty of the volume lies in the multifaceted view on the variety of dangers that languages face today, such as extinction through dwindling speaker populations and lack of adequate preservation policies or inequality in different social contexts (e.g. access to justice, education and research resources). There are examples of both loss and survival, and discussion of multiple factors that condition these two different outcomes. We pose and answer difficult questions such as whether forced interventions in preventing loss are always warranted or indeed viable. The emerging shared perspective is that of hope to inspire action towards improving the position of different languages and their speakers through research of this kind

    Endangered Languages and Languages in Danger

    Get PDF
    This peer-reviewed collection brings together the latest research on language endangerment and language rights. It creates a vibrant, interdisciplinary platform for the discussion of the most pertinent and urgent topics central to vitality and equality of languages in today’s globalised world. The novelty of the volume lies in the multifaceted view on the variety of dangers that languages face today, such as extinction through dwindling speaker populations and lack of adequate preservation policies or inequality in different social contexts (e.g. access to justice, education and research resources). There are examples of both loss and survival, and discussion of multiple factors that condition these two different outcomes. We pose and answer difficult questions such as whether forced interventions in preventing loss are always warranted or indeed viable. The emerging shared perspective is that of hope to inspire action towards improving the position of different languages and their speakers through research of this kind

    Claiming and Making Muslim Worlds: Religion and Society in the Context of the Global

    Get PDF
    To what extent can Islam be localized in an increasingly interconnected world? The contributions to this volume investigate different facets of Muslim lives in the context of increasingly dense transregional connections, highlighting how the circulation of ideas about ‘Muslimness’ contributed to the shaping of specific ideas about what constitutes Islam and its role in society and politics. Infrastructural changes have prompted the intensification of scholarly and trade networks, prompted the circulation of new literary genres or shaped stereotypical images of Muslims. This, in turn, had consequences in widely differing fields such as self-representation and governance of Muslims. The contributions in this volume explore this issue in geographical contexts ranging from South Asia to Europe and the US. Coming from the disciplines of history, anthropology, religious studies, literary studies and political science, the authors collectively demonstrate the need to combine a translocal perspective with very specific local and historical constellations. The book complicates conventional academic divisions and invites to think in historically specific translocal contexts
    corecore