5 research outputs found

    Rapid Resource Transfer for Multilingual Natural Language Processing

    Get PDF
    Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data

    Detecting grammatical errors with treebank-induced, probabilistic parsers

    Get PDF
    Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements

    Combining Trigram and Winnow in Thai OCR Error Correction

    No full text
    For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating er- ror words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words, we used a modified edit distance which reflects the charac- teristic of Thai OCR errors. Finally, a part-ofspeech trigram model and Winnow algorithm are combined to determine the most probable correction

    'Pauper aliens' and 'political refugees':A corpus linguistic approach to the language of migration in nineteenth-century newspapers

    Get PDF
    The widespread digitisation of their source base means that historians now face an overwhelming body of material. This historical ‘big data’ is only going to continue to expand, not just because digitisation features prominently on the agendas of institutions, but also because those studying late twentieth, and twenty-first century, history will have to deal with large quantities of ‘born digital’ material as they turn their gaze to the internet age. Although the interfaces currently used to access digital sources have their strengths, there is an increasing need for more effective ways for historians to work with large amounts of text. This thesis is one of the first studies to explore the potential of corpus linguistics, the computer-assisted analysis of language in very large bodies of text, as a means of approaching the ever-expanding historical archive. This thesis uses corpus linguistics to examine the representation of migrants in the British Library’s nineteenth-century newspaper collection, focusing specifically upon the discourses associated with ‘aliens’ and ‘refugees’, and how they changed over time. The nineteenth century saw an increase in global movement, which led to considerable legislative changes, including the development of many of Britain’s present-day migration controls. This thesis finds that ‘alien’ migration increased in topicality in the 1880s and 1890s and that ‘alien’ saw a striking shift in its associations that, significantly, coincided with an increase in, predominantly Jewish, migrants from the Russian Empire. Although only a small proportion of Britain’s ‘alien’ population, this group dominated newspaper reporting, which became characterised by increasingly negative language, including a strong association between the ‘alien’ and poverty. Although ‘refugee’ was often associated with more positive language than ‘alien’, this thesis finds that the actions of a small number of violent individuals influenced newspaper reporting upon political refugees, who became implicated in the alleged ‘abuse’ of the ‘right of asylum’

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Get PDF
    The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown
    corecore