90 research outputs found
Using BabelNet to improve OOV coverage in SMT
Out-of-vocabulary words (OOVs) are a ubiquitous and difficult problem in statistical machine translation (SMT). This paper studies
different strategies of using BabelNet to alleviate the negative impact brought about by OOVs. BabelNet is a multilingual encyclopedic
dictionary and a semantic network, which not only includes lexicographic and encyclopedic terms, but connects concepts and named
entities in a very large network of semantic relations. By taking advantage of the knowledge in BabelNet, three different methods –
using direct training data, domain-adaptation techniques and the BabelNet API – are proposed in this paper to obtain translations for
OOVs to improve system performance. Experimental results on English–Polish and English–Chinese language pairs show that domain
adaptation can better utilize BabelNet knowledge and performs better than other methods. The results also demonstrate that BabelNet is
a really useful tool for improving translation performance of SMT systems
Language resources and linked data: a practical perspective
Recently, experts and practitioners in language resources
have started recognizing the benefits of the linked data (LD) paradigm
for the representation and exploitation of linguistic data on the Web.
The adoption of the LD principles is leading to an emerging ecosystem of
multilingual open resources that conform to the Linguistic Linked Open
Data Cloud, in which datasets of linguistic data are interconnected and
represented following common vocabularies, which facilitates linguistic
information discovery, integration and access. In order to contribute to
this initiative, this paper summarizes several key aspects of the representation
of linguistic information as linked data from a practical perspective.
The main goal of this document is to provide the basic ideas and
tools for migrating language resources (lexicons, corpora, etc.) as LD on
the Web and to develop some useful NLP tasks with them (e.g., word
sense disambiguation). Such material was the basis of a tutorial imparted
at the EKAW’14 conference, which is also reported in the paper
Cross-Lingual Link Discovery for Under-Resourced Languages
CC BY-NC 4.0In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges,
experiences and prospects of their application to under-resourced languages. We first introduce the goals of cross-lingual
linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied
to language data can play in this context. We define under-resourced languages with a specific focus on languages actively
used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language
technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are
available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream
applications for under-resourced languages via the localisation and adaptation of existing technologies and resources
Evaluating Multiple Caching Strategies for Semantic Network Applications
Semantic networks are often used as a method of relating multiple pieces of data to each other. ConceptNet is a semantic network that contains information about words and how they relate to other words. ConceptNet and other semantic networks are often hosted remotely and accessed as a service, and data retrieval times can be large. This project examines multiple data caching strategies and their impact on the performance of two existing applications that make use of ConceptNet data. We found that the largest factor in whether or not caching improves the performance of semantic network applications is the access pattern of the particular application
- …