1,106 research outputs found
Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar
This paper presents a currently bilingual but potentially multilingual
FrameNet-based grammar library implemented in Grammatical Framework. The
contribution of this paper is two-fold. First, it offers a methodological
approach to automatically generate the grammar based on semantico-syntactic
valence patterns extracted from FrameNet-annotated corpora. Second, it provides
a proof of concept for two use cases illustrating how the acquired multilingual
grammar can be exploited in different CNL applications in the domains of arts
and tourism
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages
We present a new, unique and freely available parallel corpus containing
European Union (EU) documents of mostly legal nature. It is available in all 20
official EUanguages, with additional documents being available in the languages
of the EU candidate countries. The corpus consists of almost 8,000 documents
per language, with an average size of nearly 9 million words per language.
Pair-wise paragraph alignment information produced by two different aligners
(Vanilla and HunAlign) is available for all 190+ language pair combinations.
Most texts have been manually classified according to the EUROVOC subject
domains so that the collection can also be used to train and test multi-label
classification algorithms and keyword-assignment software. The corpus is
encoded in XML, according to the Text Encoding Initiative Guidelines. Due to
the large number of parallel texts in many languages, the JRC-Acquis is
particularly suitable to carry out all types of cross-language research, as
well as to test and benchmark text analysis software across different languages
(for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for
download at http://langtech.jrc.it/JRC-Acquis.htm
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
A survey of cross-lingual word embedding models
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p
D4.1. Technologies and tools for corpus creation, normalization and annotation
The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition
- …