10 research outputs found

    Historical Models and Serial Sources

    Get PDF
    Serial sources such as records, registers, and inventories are the ‘classic’ sources for quantitative history. Unstructured, narrative texts such as newspaper articles or reports were out of reach for historical analyses, both for practical reasons—availability, time needed for manual processing—and for methodological reasons: manual coding of texts is notoriously difficult and hampered by low inter-coder reliability. The recent availability of large amounts of digitized sources allows for the application of natural language processing, which has the potential to overcome these problems. However, the automatic evaluation of large amounts of texts—and historical texts in particular—for historical research also brings new challenges. First of all, it requires a source criticism that goes beyond the individual source and also considers the corpus as a whole. It is a well-known problem in corpus linguistics to determine the ‘balancedness’ of a corpus, but when analyzing the content of texts rather than ‘just’ the language, determining the ‘meaningfulness’ of a corpus is even more important. Second, automatic analyses require operationalizable descriptions of the information you are looking for. Third, automatically produced results require interpretation, in particular, when—as in history—the ultimate research question is qualitative, not quantitative. This, finally, poses the question, whether the insights gained could inform formal, i.e., machine-processable, models, which could serve as foundation and stepping stones for further research

    Learning languages from parallel corpora

    Full text link
    This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

    Corpus-Based Approaches to Figurative Language: Metaphor and Austerity

    Get PDF
    Austerity is a by-product of the ongoing financial crisis. As Kitson et al (2001) explain, what was a \u201cNICE\u201d (\u2018non-inflationary consistent expansion\u2019) economy has turned \u201cVILE\u201d (\u2018volatile inflation, little expansion\u2019), and the economic and social fall-out is now becoming visible. Unemployment, redundancy, inflation, recession, insecurity, and poverty all loom, causing governments, businesses and individuals to reevaluate their priorities. A changing world changes attitudes, and the earliest manifestations of such change can often be found in figurative language. Political rhetoric attempts to sweeten the bitter pill that nations have no choice but to swallow; all are invited to share the pain, make sacrifices for the common good, and weather the storm. But more sinister undertones can also be perceived. In times of social and financial dire straits, scapegoats are sought and mercilessly pursued in the press. The elderly, unemployed, and disabled are under fire for \u201csponging off the state\u201d; and as jobs become scarcer and the tax bill rises, migrant populations and asylum seekers are viewed with increasing suspicion and resentment. Calls for a \u201cbig society\u201d fall on deaf ears. Society, it seems, is shrinking as self-preservation takes hold. Austerity is a timely area of study: although austerity measures have been implemented in the past, most of the contributions here address the current political and economic situation, which means that some of the studies reported are work in progress while others look at particular \u201cwindows\u201d of language output from the recent past. Whichever their focus, the papers presented here feature up-to-the-minute research into the metaphors being used to comment upon our current socioeconomic situation. The picture of austerity that emerges from these snapshots is a complex one, and one which is likely to be developed further and more widely in the coming future

    Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

    Full text link

    Sentence Similarity and Machine Translation

    Get PDF
    Neural machine translation (NMT) systems encode an input sentence into an intermediate representation and then decode that representation into the output sentence. Translation requires deep understanding of language; as a result, NMT models trained on large amounts of data develop a semantically rich intermediate representation. We leverage this rich intermediate representation of NMT systems—in particular, multilingual NMT systems, which learn to map many languages into and out of a joint space—for bitext curation, paraphrasing, and automatic machine translation (MT) evaluation. At a high level, all of these tasks are rooted in similarity: sentence and document alignment requires measuring similarity of sentences and documents, respectively; paraphrasing requires producing output which is similar to an input; and automatic MT evaluation requires measuring the similarity between MT system outputs and corresponding human reference translations. We use multilingual NMT for similarity in two ways: First, we use a multilingual NMT model with a fixed-size intermediate representation (Artetxe and Schwenk, 2018) to produce multilingual sentence embeddings, which we use in both sentence and document alignment. Second, we train a multilingual NMT model and show that it generalizes to the task of generative paraphrasing (i.e., “translating” from Russian to Russian), when used in conjunction with a simple generation algorithm to discourage copying from the input to the output. We also use this model for automatic MT evaluation, to force decode and score MT system outputs conditioned on their respective human reference translations. Since we leverage multilingual NMT models, each method works in many languages using a single model. We show that simple methods, which leverage the intermediate representation of multilingual NMT models trained on large amounts of bitext, outperform prior work in paraphrasing, sentence alignment, document alignment, and automatic MT evaluation. This finding is consistent with recent trends in the natural language processing community, where large language models trained on huge amounts of unlabeled text have achieved state-of-the-art results on tasks such as question answering, named entity recognition, and parsing

    Challenges in building a multilingual alpine heritage corpus

    Full text link
    This paper describes our efforts to build a multilingual heritage corpus of alpine texts. Currently we digitize the yearbooks of the Swiss Alpine Club which contain articles in French, German, Italian and Romansch. Articles comprise mountaineering reports from all corners of the earth, but also scientific topics such as topography, geology or glacierology as well as occasional poetry and lyrics. We have already scanned close to 70,000 pages which has resulted in a corpus of 25 million words, 10% of which is a parallel French-German corpus. We have solved a number of challenges in automatic language identification and text structure recognition. Our next goal is to identify the great variety of toponyms (e.g. names of mountains and valleys, glaciers and rivers, trails and cabins) in this corpus, and we sketch how a large gazetteer of Swiss topographical names can be exploited for this purpose. Despite the size of the resource, exact matching leads to a low recall because of spelling variations, language mixtures and partial repetitions

    Incremental Coreference Resolution for German

    Full text link
    The main contributions of this thesis are as follows: 1. We introduce a general model for coreference and explore its application to German. • The model features an incremental discourse processing algorithm which allows it to coherently address issues caused by underspecification of mentions, which is an especially pressing problem regarding certain German pronouns. • We introduce novel features relevant for the resolution of German pronouns. A subset of these features are made accessible through the incremental architecture of the discourse processing model. • In evaluation, we show that the coreference model combined with our features provides new state-of-the-art results for coreference and pronoun resolution for German. 2. We elaborate on the evaluation of coreference and pronoun resolution. • We discuss evaluation from the view of prospective downstream applications that benefit from coreference resolution as a preprocessing component. Addressing the shortcomings of the general evaluation framework in this regard, we introduce an alternative framework, the Application Related Coreference Scores (ARCS). • The ARCS framework enables a thorough comparison of different system outputs and the quantification of their similarities and differences beyond the common coreference evaluation. We demonstrate how the framework is applied to state-of-the-art coreference systems. This provides a method to track specific differences in system outputs, which assists researchers in comparing their approaches to related work in detail. 3. We explore semantics for pronoun resolution. • Within the introduced coreference model, we explore distributional approaches to estimate the compatibility of an antecedent candidate and the occurrence context of a pronoun. We compare a state-of-the-art approach for word embeddings to syntactic co-occurrence profiles to this end. • In comparison to related work, we extend the notion of context and thereby increase the applicability of our approach. We find that a combination of both compatibility models, coupled with the coreference model, provides a large potential for improving pronoun resolution performance. We make available all our resources, including a web demo of the system, at: http://pub.cl.uzh.ch/purl/coreference-resolutio

    Musterwandel – Sortenwandel : aktuelle Tendenzen der diachronen Text(sorten)linguistik

    Get PDF
    The texts in this volume deal with the current transformation of text types and text patterns and thus make a significant contribution to current issues of diachronic linguistics aligned text types and different transformation processes of recent usage histor

    Musterwandel – Sortenwandel

    Get PDF
    corecore