257 research outputs found

    Efficient Convolutional Neural Networks for Diacritic Restoration

    Full text link
    Diacritic restoration has gained importance with the growing need for machines to understand written texts. The task is typically modeled as a sequence labeling problem and currently Bidirectional Long Short Term Memory (BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018) show the advantages of Temporal Convolutional Neural Networks (TCN) over Recurrent Neural Networks (RNN) for sequence modeling in terms of performance and computational resources. As diacritic restoration benefits from both previous as well as subsequent timesteps, we further apply and evaluate a variant of TCN, Acausal TCN (A-TCN), which incorporates context from both directions (previous and future) rather than strictly incorporating previous context as in the case of TCN. A-TCN yields significant improvement over TCN for diacritization in three different languages: Arabic, Yoruba, and Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making A-TCN an efficient alternative over BiLSTM since convolutions can be trained in parallel. A-TCN is significantly faster than BiLSTM at inference time (270%-334% improvement in the amount of text diacritized per minute).Comment: accepted in EMNLP 201

    Spell-checking in Spanish: the case of diacritic accents

    Get PDF
    This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.Peer ReviewedPostprint (author’s final draft

    Corpus-Based Approaches to Igbo Diacritic Restoration

    Get PDF
    With natural language processing (NLP), researchers aim to get the computer to identify and understand the patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntaxes, pragmatics and phonology, which needs to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 95% of the world’s 7000 languages are low-resourced for NLP i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word were use. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors. The processes and techniques involved in projecting embeddings from a model trained with English texts to an Igbo embedding space and the creation of intrinsic evaluation tasks to validate the models were also discussed. A comparative analysis of the results indicate that all the approaches significantly improved on the baseline performance which uses the unigram model. The details of the processed involved in building the models as well as the possible directions for future work are discussed in this work

    Hybrid model of post-processing techniques for Arabic optical character recognition

    Get PDF
    Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from the loss of important features as it uses N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval

    The field of ancient Cham art in France: a 20th century creation: a study of museological and colonial contexts from the late 19th century to the present

    Get PDF
    This thesis takes a new look at the art of ancient Champa. Breaking away from traditional studies, it looks at the art not in its ancient Cham context, but rather through its present and recent past contexts. The study asks “What exactly is Cham art?” To answer this, I examine not only the artworks, but also the museums and exhibitions, the display and classification. After an introduction explaining the background to the research, Chapter 2 contrasts two statues of Ganesh in French museums, tracing their biographies and questioning what constitutes Cham art. In Chapter 3, I examine the architectural line-drawings of Henri Parmentier, which have represented Ancient Champa visually for over a century, revealing the complex temporality within which they mediate between the present and multiple pasts. Chapter 4 looks at the history of the Danang Cham Sculpture Museum through the choices and decisions of the men who have shaped Cham art into what it is today. In Chapter 5 I investigate how Cham art was displayed in a series of exhibitions in museums and a department store basement in the United States, Paris and Brussels, while Chapter 6 is a study of a major Cham exhibition at the Musée Guimet, examining its narrative threads and historical and colonial interconnections and its implications for Cham art history. I conclude that Cham art is much more than just the physical traces of the Cham past. It is the preserving, displacing, labelling, copying, interpreting and displaying of the art that makes it what it is just as much as its original functions. I suggest, therefore, that the field of Cham art studies as we understand and view it today is actually something of our own invention, a largely 20th century construct. We do not yet know, therefore, what the Ancient Cham art of the future will be

    Coming to Terms with Legacies of the Vietnam War

    Get PDF
    This report is the result of a symposium convened by the University of Dayton Human Rights Center in October 2020. For their contributions to that symposium we thank the following speakers: Allison Varzally, Selika Ducksworth-Lawton, Yen Le Espiritu, Tom Grace, David Cortright, Cynthia Enloe, David Kieran, Patrick Hagopian, Scott Laderman, Andrew Bacevich, Chuck Searcy, Dang Quang Toan, Colleen Murphy, Katherine Gallagher, John Goines III, Ben Schrader, Susan Hammond, Bich-Ngoc Turner, and Tim Rieser. Heather Bowser, Đạt Duthịnh, Garett Reppenhagen, and Mike Boehm enriched the symposium by discussing experiences of advocacy around war legacies; we are particularly thankful for the chance to share their stories here

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF
    corecore