Search CORE

27 research outputs found

The Danish Gigaword Project

Author: Baglini Rebekah
Christiansen Morten H.
Ciosici Manuel R.
Dalsgaard Jacob Aarup
Fusaroli Riccardo
Henrichsen Peter Juel
Hvingelby Rasmus
Kirkedal Andreas
Kjeldsen Alex Speed
Ladefoged Claus
Nielsen Finn Årup
Petersen Malte Lau
Rystrøm Jonathan Hvithamar
Strømberg-Derczynski Leon
Varab Daniel
Publication venue
Publication date: 01/01/2020
Field of study

Danish is a North Germanic/Scandinavian language spoken primarily in Denmark, a country with a tradition of technological and scientific innovation. However, from a technological perspective, the Danish language has received relatively little attention and, as a result, Danish language technology is hard to develop, in part due to a lack of large or broad-coverage Danish corpora. This paper describes the Danish Gigaword project, which aims to construct a freely-available one billion word corpus of Danish text that represents the breadth of the written language

arXiv.org e-Print Archive

Copenhagen University Research Information System

Online Research Database In Technology

Inferring Morphological Rules from Small Examples using 0/1 Linear Programming

Author: Claessen Koen
Lilliestr\uf6m Ann
Smallbone Nicholas
Publication venue
Publication date: 01/01/2019
Field of study

We show how to express the problem of finding an optimal morpheme segmentation from a set of labelled words as a 0/1 linear programming problem, and how to build on this to analyse a language’s morphology. The result is an automatic method for segmentation and labelling that works well even when there is very little training data available

Chalmers Research

Learning languages from parallel corpora

Author: Graën Johannes
Publication venue: Ljubljana University Press
Publication date: 29/12/2022
Field of study

This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

ZORA

The Parallel Meaning Bank:A Framework for Semantically Annotating Multiple Languages

Author: Abzianidze Lasha
Bos Johan
van Noord Rik
Wang Chunliu
Publication venue
Publication date: 01/01/2020
Field of study

This paper gives a general description of the ideas behind the Parallel Meaning Bank, a framework with the aim to provide an easy way to annotate compositional semantics for texts written in languages other than English. The annotation procedure is semi-automatic, and comprises seven layers of linguistic information: segmentation, symbolisation, semantic tagging, word sense disambiguation, syntactic structure, thematic role labelling, and co-reference. New languages can be added to the meaning bank as long as the documents are based on translations from English, but also introduce new interesting challenges on the linguistics assumptions underlying the Parallel Meaning Bank.Comment: 13 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Utrecht University Repository

Dissertations of the University of Groningen

A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content

Author: Rosales Nunez José Carlos
Seddah Djamé
Wisniewski Guillaume
Publication venue: HAL CCSD
Publication date: 30/09/2019
Field of study

International audienceThis work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models

24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Author
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content

Author: Rosales Nunez José Carlos
Seddah Djamé
Wisniewski Guillaume
Publication venue: HAL CCSD
Publication date: 30/09/2019
Field of study

INRIA a CCSD electronic archive server