Search CORE

12,669 research outputs found

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Author: Barrón-Cedeño Alberto
España-Bonet Cristina
van Genabith Josef
Varga Ádám Csaba
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2017
Field of study

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Building a semantically annotated corpus of clinical texts

Author: Andrea Setzer
Angus Roberts
Denny
Franzén
Friedman
Gennari
George Demetriou
Hersh
Hripcsak
Ian Roberts
Kim
Lindberg
Mark Hepple
Meystre
Pestian
Robert Gaizauskas
Roberts
Tanabe
Yikun Guo
Publication venue: 'Elsevier BV'
Publication date: 01/10/2009
Field of study

In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains

Elsevier - Publisher Connector

Crossref

White Rose Research Online

Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

Author: Grefenstette Gregory
Muchemi Lawrence
Publication venue
Publication date: 24/05/2016
Field of study

Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

Author: Daille Béatrice
Delpech Estelle
Lemaire Claire
Morin Emmanuel
Publication venue
Publication date: 11/09/2012
Field of study

This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Using a Probabilistic Class-Based Lexicon for Lexical Ambiguity Resolution

Author: Prescher Detlef
Riezler Stefan
Rooth Mats
Publication venue
Publication date: 01/01/2000
Field of study

This paper presents the use of probabilistic class-based lexica for disambiguation in target-word selection. Our method employs minimal but precise contextual information for disambiguation. That is, only information provided by the target-verb, enriched by the condensed information of a probabilistic class-based lexicon, is used. Induction of classes and fine-tuning to verbal arguments is done in an unsupervised manner by EM-based clustering techniques. The method shows promising results in an evaluation on real-world translations.Comment: 7 pages, uses colacl.st

arXiv.org e-Print Archive

CiteSeerX

Weaving creativity into the Semantic Web: a language-processing approach

Author: Jordanous Anna
Keller Bill
Publication venue
Publication date: 01/01/2012
Field of study

This paper describes a novel language processing ap- proach to the analysis of creativity and the development of a machine-readable ontology of creativity. The ontol- ogy provides a conceptualisation of creativity in terms of a set of fourteen key components or building blocks and has application to research into the nature of cre- ativity in general and to the evaluation of creative prac- tice, in particular. We further argue that the provision of a machine readable conceptualisation of creativity pro- vides a small, but important step towards addressing the problem of automated evaluation, ’the Achilles’ heel of AI research on creativity’ (Boden 1999)

CiteSeerX

Kent Academic Repository

King's Research Portal

Sussex Research Online