5 research outputs found
Rapid Resource Transfer for Multilingual Natural Language Processing
Until recently the focus of the Natural Language Processing (NLP)
community has been on a handful of mostly European languages. However, the
rapid changes taking place in the economic and political climate of the
world precipitate a similar change to the relative importance given to
various languages. The importance of rapidly acquiring NLP resources and
computational capabilities in new languages is widely accepted.
Statistical NLP models have a distinct advantage over rule-based methods
in achieving this goal since they require far less manual labor. However,
statistical methods require two fundamental resources for training: (1)
online corpora (2) manual annotations. Creating these two resources can be
as difficult as porting rule-based methods.
This thesis demonstrates the feasibility of acquiring both corpora and
annotations by exploiting existing resources for well-studied languages.
Basic resources for new languages can be acquired in a rapid and
cost-effective manner by utilizing existing resources cross-lingually.
Currently, the most viable method of obtaining online corpora is
converting existing printed text into electronic form using Optical
Character Recognition (OCR). Unfortunately, a language that lacks online
corpora most likely lacks OCR as well. We tackle this problem by taking an
existing OCR system that was desgined for a specific language and using
that OCR system for a language with a similar script. We present a
generative OCR model that allows us to post-process output from a
non-native OCR system to achieve accuracy close to, or better than, a
native one. Furthermore, we show that the performance of a native or
trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We
present an algorithm that projects dependency trees across parallel
corpora. We also show that a reasonable quality treebank can be generated
by combining projection with a small amount of language-specific
post-processing. The projected treebank allows us to train a parser that
performs comparably to a parser trained on manually generated data
Detecting grammatical errors with treebank-induced, probabilistic parsers
Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements
Combining Trigram and Winnow in Thai OCR Error Correction
For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating er- ror words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words, we used a modified edit distance which reflects the charac- teristic of Thai OCR errors. Finally, a part-ofspeech trigram model and Winnow algorithm are combined to determine the most probable correction
'Pauper aliens' and 'political refugees':A corpus linguistic approach to the language of migration in nineteenth-century newspapers
The widespread digitisation of their source base means that historians now face an overwhelming body of material. This historical ‘big data’ is only going to continue to expand, not just because digitisation features prominently on the agendas of institutions, but also because those studying late twentieth, and twenty-first century, history will have to deal with large quantities of ‘born digital’ material as they turn their gaze to the internet age. Although the interfaces currently used to access digital sources have their strengths, there is an increasing need for more effective ways for historians to work with large amounts of text. This thesis is one of the first studies to explore the potential of corpus linguistics, the computer-assisted analysis of language in very large bodies of text, as a means of approaching the ever-expanding historical archive. This thesis uses corpus linguistics to examine the representation of migrants in the British Library’s nineteenth-century newspaper collection, focusing specifically upon the discourses associated with ‘aliens’ and ‘refugees’, and how they changed over time. The nineteenth century saw an increase in global movement, which led to considerable legislative changes, including the development of many of Britain’s present-day migration controls. This thesis finds that ‘alien’ migration increased in topicality in the 1880s and 1890s and that ‘alien’ saw a striking shift in its associations that, significantly, coincided with an increase in, predominantly Jewish, migrants from the Russian Empire. Although only a small proportion of Britain’s ‘alien’ population, this group dominated newspaper reporting, which became characterised by increasingly negative language, including a strong association between the ‘alien’ and poverty. Although ‘refugee’ was often associated with more positive language than ‘alien’, this thesis finds that the actions of a small number of violent individuals influenced newspaper reporting upon political refugees, who became implicated in the alleged ‘abuse’ of the ‘right of asylum’
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown