149 research outputs found
Extending, trimming and fusing WordNet for technical documents
This paper describes a tool for the automatic
extension and trimming of a multilingual
WordNet database for cross-lingual retrieval
and multilingual ontology building in
intranets and domain-specific document
collections. Hierarchies, built from
automatically extracted terms and combined
with the WordNet relations, are trimmed
with a disambiguation method based on the
document salience of the words in the
glosses. The disambiguation is tested in a
cross-lingual retrieval task, showing
considerable improvement (7%-11%). The
condensed hierarchies can be used as
browse-interfaces to the documents
complementary to retrieval
MEANING-full effects in information retrieval
This deliverable reports on testing the use and effect
of the integration of the MEANING technology in the
TwentyOne search engine of Irion
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
Word embeddings are widely used in Nat-ural Language Processing, mainly due totheir success in capturing semantic infor-mation from massive corpora. However,their creation process does not allow thedifferent meanings of a word to be auto-matically separated, as it conflates theminto a single vector. We address this issueby proposing a new model which learnsword and sense embeddings jointly. Ourmodel exploits large corpora and knowl-edge from semantic networks in order toproduce a unified vector space of wordand sense embeddings. We evaluate themain features of our approach both qual-itatively and quantitatively in a variety oftasks, highlighting the advantages of theproposed method in comparison to state-of-the-art word- and sense-based models
Automatic Detection of Modality with ITGETARUNS
In this paper we present a system for modality detection which is then used for Subjectivity and Factuality evaluation. The system has been tested lately on a task for Subjectivity and Irony detection in Italian tweets , where the performance was 10th and 4th, respectively, over 27 participants overall. We will focus our paper on an internal evaluation where we considered three national newspapers Il Corriere, Repubblica, Libero. This task was prompted by a project on the evaluation of press stylistic features in political discourse. The project used newspaper articles from the same sources over a period of three months, thus including latest political 2013 governmental crisis. We intended to produce a similar experiment and evaluate results in comparison with previous 2011 crisis. In this evaluation, we focused on Subjectivity, Polarity and Factuality which include Modality evaluation. Final graphs at the end of the paper will show results confirming our previous findings about differences in style, with Il Corriere emerging as the most atypical
DISI -Via Sommarive 14 -38123 Povo
Abstract Handling everyday tasks such as search, classification and integration is becoming increasingly difficult and sometimes even impossible due to the increasing streams of data available. To overcome such an information overload we need more accurate information processing tools capable of handling big amounts of data. In particular, handling metadata can give us leverage over the data and enable structured processing of data, however, while some of this metadata is in a computer readable format, some of it is manually created in ambiguous natural language. Thus, accessing the semantics of natural language can increase the quality of information processing. We propose a natural language metadata understanding architecture that enables applications such as semantic matching, classification and search based on natural language metadata by providing a translation into a formal language which outperforms the state of the art by 15%
Lexical simplification for the systematic support of cognitive accessibility guidelines
The Internet has come a long way in recent years, contributing to the proliferation of
large volumes of digitally available information. Through user interfaces we can access
these contents, however, they are not accessible to everyone. The main users affected are
people with disabilities, who are already a considerable number, but accessibility barriers
affect a wide range of user groups and contexts of use in accessing digital information.
Some of these barriers are caused by language inaccessibility when texts contain long
sentences, unusual words and complex linguistic structures. These accessibility barriers
directly affect people with cognitive disabilities.
For the purpose of making textual content more accessible, there are initiatives such
as the Easy Reading guidelines, the Plain Language guidelines and some of the languagespecific
Web Content Accessibility Guidelines (WCAG). These guidelines provide documentation,
but do not specify methods for meeting the requirements implicit in these
guidelines in a systematic way. To obtain a solution, methods from the Natural Language
Processing (NLP) discipline can provide support for achieving compliance with the cognitive
accessibility guidelines for the language.
The task of text simplification aims at reducing the linguistic complexity of a text from
a syntactic and lexical perspective, the latter being the main focus of this Thesis. In this
sense, one solution space is to identify in a text which words are complex or uncommon,
and in the case that there were, to provide a more usual and simpler synonym, together
with a simple definition, all oriented to people with cognitive disabilities.
With this goal in mind, this Thesis presents the study, analysis, design and development
of an architecture, NLP methods, resources and tools for the lexical simplification of
texts for the Spanish language in a generic domain in the field of cognitive accessibility.
To achieve this, each of the steps present in the lexical simplification processes is studied,
together with methods for word sense disambiguation. As a contribution, different
types of word embedding are explored and created, supported by traditional and dynamic
embedding methods, such as transfer learning methods. In addition, since most of the
NLP methods require data for their operation, a resource in the framework of cognitive
accessibility is presented as a contribution.Internet ha avanzado mucho en los últimos años contribuyendo a la proliferación de
grandes volúmenes de información disponible digitalmente. A través de interfaces de
usuario podemos acceder a estos contenidos, sin embargo, estos no son accesibles a todas
las personas. Los usuarios afectados principalmente son las personas con discapacidad
siendo ya un número considerable, pero las barreras de accesibilidad afectan a un gran
rango de grupos de usuarios y contextos de uso en el acceso a la información digital. Algunas
de estas barreras son causadas por la inaccesibilidad al lenguaje cuando los textos
contienen oraciones largas, palabras inusuales y estructuras lingüísticas complejas. Estas
barreras de accesibilidad afectan directamente a las personas con discapacidad cognitiva.
Con el fin de hacer el contenido textual más accesible, existen iniciativas como las
pautas de Lectura Fácil, las pautas de Lenguaje Claro y algunas de las pautas de Accesibilidad
al Contenido en la Web (WCAG) específicas para el lenguaje. Estas pautas
proporcionan documentación, pero no especifican métodos para cumplir con los requisitos
implícitos en estas pautas de manera sistemática. Para obtener una solución, los
métodos de la disciplina del Procesamiento del Lenguaje Natural (PLN) pueden dar un
soporte para alcanzar la conformidad con las pautas de accesibilidad cognitiva relativas al
lenguaje
La tarea de la simplificación de textos del PLN tiene como objetivo reducir la complejidad
lingüística de un texto desde una perspectiva sintáctica y léxica, siendo esta última
el enfoque principal de esta Tesis. En este sentido, un espacio de solución es identificar
en un texto qué palabras son complejas o poco comunes, y en el caso de que sí hubiera,
proporcionar un sinónimo más usual y sencillo, junto con una definición sencilla, todo
ello orientado a las personas con discapacidad cognitiva.
Con tal meta, en esta Tesis, se presenta el estudio, análisis, diseño y desarrollo de
una arquitectura, métodos PLN, recursos y herramientas para la simplificación léxica de
textos para el idioma español en un dominio genérico en el ámbito de la accesibilidad
cognitiva. Para lograr esto, se estudia cada uno de los pasos presentes en los procesos
de simplificación léxica, junto con métodos para la desambiguación del sentido de las
palabras. Como contribución, diferentes tipos de word embedding son explorados y creados,
apoyados por métodos embedding tradicionales y dinámicos, como son los métodos
de transfer learning. Además, debido a que gran parte de los métodos PLN requieren
datos para su funcionamiento, se presenta como contribución un recurso en el marco de
la accesibilidad cognitiva.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Antonio Macías Iglesias.- Secretario: Israel González Carrasco.- Vocal: Raquel Hervás Ballestero
Evaluating cross-language annotation transfer in the MultiSemCor corpus
In this paper we illustrate and evaluate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested in the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor texts are aligned at the word level and semantically annotated with a shared inventory of senses. We present some experiments carried out to evaluate the different steps involved in the methodology. The results of the evaluation suggest that the cross-language annotation transfer methodology is a promising solution allowing for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.
- …