27 research outputs found
The Danish Gigaword Project
Danish is a North Germanic/Scandinavian language spoken primarily in Denmark,
a country with a tradition of technological and scientific innovation. However,
from a technological perspective, the Danish language has received relatively
little attention and, as a result, Danish language technology is hard to
develop, in part due to a lack of large or broad-coverage Danish corpora. This
paper describes the Danish Gigaword project, which aims to construct a
freely-available one billion word corpus of Danish text that represents the
breadth of the written language
Inferring Morphological Rules from Small Examples using 0/1 Linear Programming
We show how to express the problem of finding an optimal morpheme segmentation from a set of labelled words as a 0/1 linear programming problem, and how to build on this to analyse a language’s morphology. The result is an automatic method for segmentation and labelling that works well even when there is very little training data available
Learning languages from parallel corpora
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material
The Parallel Meaning Bank:A Framework for Semantically Annotating Multiple Languages
This paper gives a general description of the ideas behind the Parallel
Meaning Bank, a framework with the aim to provide an easy way to annotate
compositional semantics for texts written in languages other than English. The
annotation procedure is semi-automatic, and comprises seven layers of
linguistic information: segmentation, symbolisation, semantic tagging, word
sense disambiguation, syntactic structure, thematic role labelling, and
co-reference. New languages can be added to the meaning bank as long as the
documents are based on translations from English, but also introduce new
interesting challenges on the linguistics assumptions underlying the Parallel
Meaning Bank.Comment: 13 pages, 5 figures, 1 tabl
A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
International audienceThis work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models
A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
International audienceThis work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models