566 research outputs found

    An introduction to crowdsourcing for language and multimedia technology research

    Get PDF
    Language and multimedia technology research often relies on large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible

    A Survey of Corpora for Germanic Low-Resource Languages and Dialects

    Get PDF
    Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available

    A Survey of Corpora for Germanic Low-Resource Languages and Dialects

    Full text link
    Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available. We share a companion website of this overview at https://github.com/mainlp/germanic-lrl-corpora .Comment: NoDaLiDa 202

    FinnFN 1.0: The Finnish frame semantic database

    Get PDF
    The article describes the process of creating a Finnish language FrameNet or FinnFN, based on the original English language FrameNet hosted at the International Computer Science Institute in Berkeley, California. We outline the goals and results relating to the FinnFN project and especially to the creation of the FinnFrame corpus. The main aim of the project was to test the universal applicability of frame semantics by annotating real Finnish using the same frames and annotation conventions as in the original Berkeley FrameNet project. From Finnish newspaper corpora, 40,721 sentences were automatically retrieved and manually annotated as example sentences evoking certain frames. This became the FinnFrame corpus. Applying the Berkeley FrameNet annotation conventions to the Finnish language required some modifications due to Finnish morphology, and a convention for annotating individual morphemes within words was introduced for phenomena such as compounding, comparatives and case endings. Various questions about cultural salience across the two languages arose during the project, but problematic situations occurred only in a few examples, which we also discuss in the article. The article shows that, barring a few minor instances, the universality hypothesis of frames is largely confirmed for languages as different as Finnish and English.Peer reviewe

    Evaluating MT for massive open online courses: a multifaceted comparison between PBSMT and NMT systems

    Get PDF
    This article reports a multifaceted comparison between statistical and neural machine translation (MT) systems that were developed for translation of data from Massive Open Online Courses (MOOCs). The study uses four language pairs: English to German, Greek, Portuguese, and Russian. Translation quality is evaluated using automatic metrics and human evaluation, carried out by professional translators. Results show that neural MT is preferred in side-by-side ranking, and is found to contain fewer overall errors. Results are less clear-cut for some error categories, and for temporal and technical post-editing effort. In addition, results are reported based on sentence length, showing advantages and disadvantages depending on the particular language pair and MT paradigm

    Evaluating MT for massive open online courses

    Get PDF
    This article reports a multifaceted comparison between statistical and neural machine translation (MT) systems that were developed for translation of data from massive open online courses (MOOCs). The study uses four language pairs: English to German, Greek, Portuguese, and Russian. Translation quality is evaluated using automatic metrics and human evaluation, carried out by professional translators. Results show that neuralMTis preferred in side-by-side ranking, and is found to contain fewer overall errors. Results are less clear-cut for some error categories, and for temporal and technical post-editing effort. In addition, results are reported based on sentence length, showing advantages and disadvantages depending on the particular language pair and MT paradigm

    Event extraction and representation: A case study for the portuguese language

    Get PDF
    Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and represented. The extracted information can be represented by specialized ontologies, supporting knowledge-based reasoning and inference processes. In this work, we will describe, in detail, our proposal for event extraction from Portuguese documents. The proposed approach is based on a pipeline of specialized natural language processing tools; namely, a part-of-speech tagger, a named entities recognizer, a dependency parser, semantic role labeling, and a knowledge extraction module. The architecture is language-independent, but its modules are language-dependent and can be built using adequate AI (i.e., rule-based or machine learning) methodologies. The developed system was evaluated with a corpus of Portuguese texts and the obtained results are presented and analysed. The current limitations and future work are discussed in detail

    The Reference Corpus of Contemporary Portuguese and related resources

    Get PDF
    The extraordinary growth of computer applications, particularly over the last two decades, has enabled the easy compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and natural language engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data; contrastive studies between varieties of a particular language aiming at finding factors of unity and diversity; cross-linguistic contrastive studies; grammars; lexica and dictionaries; terminologies; assisted translation materials; language teaching materials; computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre of Linguistics of the University of Lisbon (CLUL)i of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – is being compiled at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (both in its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.info:eu-repo/semantics/publishedVersio

    Quantifying the effect of machine translation in a high-quality human translation production process

    Get PDF
    This paper studies the impact of machine translation (MT) on the translation workflow at the Directorate-General for Translation (DGT), focusing on two language pairs and two MT paradigms: English-into-French with statistical MT and English-into-Finnish with neural MT. We collected data from 20 professional translators at DGT while they carried out real translation tasks in normal working conditions. The participants enabled/disabled MT for half of the segments in each document. They filled in a survey at the end of the logging period. We measured the productivity gains (or losses) resulting from the use of MT and examined the relationship between technical effort and temporal effort. The results show that while the usage of MT leads to productivity gains on average, this is not the case for all translators. Moreover, the two technical effort indicators used in this study show weak correlations with post-editing time. The translators' perception of their speed gains was more or less in line with the actual results. Reduction of typing effort is the most frequently mentioned reason why participants preferred working with MT, but also the psychological benefits of not having to start from scratch were often mentioned
    corecore