479 research outputs found
Composing Measures for Computing Text Similarity
We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence.
We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse.
As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups
Controlling Hallucinations at Word Level in Data-to-Text Generation
Data-to-Text Generation (DTG) is a subfield of Natural Language Generation
aiming at transcribing structured data in natural language descriptions. The
field has been recently boosted by the use of neural-based generators which
exhibit on one side great syntactic skills without the need of hand-crafted
pipelines; on the other side, the quality of the generated text reflects the
quality of the training data, which in realistic settings only offer
imperfectly aligned structure-text pairs. Consequently, state-of-art neural
models include misleading statements - usually called hallucinations - in their
outputs. The control of this phenomenon is today a major challenge for DTG, and
is the problem addressed in the paper.
Previous work deal with this issue at the instance level: using an alignment
score for each table-reference pair. In contrast, we propose a finer-grained
approach, arguing that hallucinations should rather be treated at the word
level. Specifically, we propose a Multi-Branch Decoder which is able to
leverage word-level labels to learn the relevant parts of each training
instance. These labels are obtained following a simple and efficient scoring
procedure based on co-occurrence analysis and dependency parsing. Extensive
evaluations, via automated metrics and human judgment on the standard WikiBio
benchmark, show the accuracy of our alignment labels and the effectiveness of
the proposed Multi-Branch Decoder. Our model is able to reduce and control
hallucinations, while keeping fluency and coherence in generated texts. Further
experiments on a degraded version of ToTTo show that our model could be
successfully used on very noisy settings.Comment: 20 pages, 6 figures, 5 tables (excluding Appendix). Source code:
https://github.com/KaijuML/dtt-multi-branc
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
Automatic reconstruction of itineraries from descriptive texts
Esta tesis se inscribe dentro del marco del proyecto PERDIDO donde los objetivos son la extracción y reconstrucción de itinerarios a partir de documentos textuales. Este trabajo se ha realizado en colaboración entre el laboratorio LIUPPA de l' Université de Pau et des Pays de l' Adour (France), el grupo de Sistemas de Información Avanzados (IAAA) de la Universidad de Zaragoza y el laboratorio COGIT de l' IGN (France). El objetivo de esta tesis es concebir un sistema automático que permita extraer, a partir de guías de viaje o descripciones de itinerarios, los desplazamientos, además de representarlos sobre un mapa. Se propone una aproximación para la representación automática de itinerarios descritos en lenguaje natural. Nuestra propuesta se divide en dos tareas principales. La primera pretende identificar y extraer de los textos describiendo itinerarios información como entidades espaciales y expresiones de desplazamiento o percepción. El objetivo de la segunda tarea es la reconstrucción del itinerario. Nuestra propuesta combina información local extraída gracias al procesamiento del lenguaje natural con datos extraídos de fuentes geográficas externas (por ejemplo, gazetteers). La etapa de anotación de informaciones espaciales se realiza mediante una aproximación que combina el etiquetado morfo-sintáctico y los patrones léxico-sintácticos (cascada de transductores) con el fin de anotar entidades nombradas espaciales y expresiones de desplazamiento y percepción. Una primera contribución a la primera tarea es la desambiguación de topónimos, que es un problema todavía mal resuelto dentro del reconocimiento de entidades nombradas (Named Entity Recognition - NER) y esencial en la recuperación de información geográfica. Se plantea un algoritmo no supervisado de georreferenciación basado en una técnica de clustering capaz de proponer una solución para desambiguar los topónimos los topónimos encontrados en recursos geográficos externos, y al mismo tiempo, la localización de topónimos no referenciados. Se propone un modelo de grafo genérico para la reconstrucción automática de itinerarios, donde cada nodo representa un lugar y cada arista representa un camino enlazando dos lugares. La originalidad de nuestro modelo es que además de tener en cuenta los elementos habituales (caminos y puntos del recorrido), permite representar otros elementos involucrados en la descripción de un itinerario, como por ejemplo los puntos de referencia visual. Se calcula de un árbol de recubrimiento mínimo a partir de un grafo ponderado para obtener automáticamente un itinerario bajo la forma de un grafo. Cada arista del grafo inicial se pondera mediante un método de análisis multicriterio que combina criterios cualitativos y cuantitativos. El valor de estos criterios se determina a partir de informaciones extraídas del texto e informaciones provenientes de recursos geográficos externos. Por ejemplo, se combinan las informaciones generadas por el procesamiento del lenguaje natural como las relaciones espaciales describiendo una orientación (ej: dirigirse hacia el sur) con las coordenadas geográficas de lugares encontrados dentro de los recursos para determinar el valor del criterio ``relación espacial''. Además, a partir de la definición del concepto de itinerario y de las informaciones utilizadas en la lengua para describir un itinerario, se ha modelado un lenguaje de anotación de información espacial adaptado a la descripción de desplazamientos, apoyándonos en las recomendaciones del consorcio TEI (Text Encoding and Interchange). Finalmente, se ha implementado y evaluado las diferentes etapas de nuestra aproximación sobre un corpus multilingüe de descripciones de senderos y excursiones (francés, español, italiano)
A graph-based framework for data retrieved from criminal-related documents
A digitalização das empresas e dos serviços tem potenciado o tratamento e análise de um crescente volume
de dados provenientes de fontes heterogeneas, com desafios emergentes, nomeadamente ao nível da representação
do conhecimento. Também os Órgãos de Polícia Criminal (OPC) enfrentam o mesmo desafio,
tendo em conta o volume de dados não estruturados, provenientes de relatórios policiais, sendo analisados
manualmente pelo investigadores criminais, consumindo tempo e recursos.
Assim, a necessidade de extrair e representar os dados não estruturados existentes em documentos relacionados
com o crime, de uma forma automática, permitindo a redução da análise manual efetuada pelos
investigadores criminais. Apresenta-se como um desafio para a ciência dos computadores, dando a possibilidade
de propor uma alternativa computacional que permita extrair e representar os dados, adaptando
ou propondo métodos computacionais novos.
Actualmente existem vários métodos computacionais aplicados ao domínio criminal, nomeadamente a identificação
e classificação de entidades nomeadas, por exemplo narcóticos, ou a extracção de relações entre
entidades relevantes para a investigação criminal. Estes métodos são maioritariamente aplicadas à lingua
inglesa, e em Portugal não há muita atenção à investigação nesta área, inviabilizando a sua aplicação no
contexto da investigação criminal.
Esta tese propõe uma solução integrada para a representação dos dados não estruturados existentes em
documentos, usando um conjunto de métodos computacionais: Preprocessamento de Documentos, que
agrupa uma tarefa de Extracção, Transformação e Carregamento adaptado aos documentos relacionados
com o crime, seguido por um pipeline de Processamento de Linguagem Natural aplicado à lingua portuguesa,
para uma análise sintática e semântica dos dados textuais; Método de Extracção de Informação 5W1H
que agrupa métodos de Reconhecimento de Entidades Nomeadas, a detecção da função semântica e a
extracção de termos criminais; Preenchimento da Base de Dados de Grafos e Enriquecimento, permitindo
a representação dos dados obtidos numa base de dados de grafos Neo4j. Globalmente a solução integrada apresenta resultados promissores, cujos resultados foram validados usando
protótipos desemvolvidos para o efeito. Demonstrou-se ainda a viabilidade da extracção dos dados não
estruturados, a sua interpretação sintática e semântica, bem como a representação na base de dados de
grafos; Abstract:
The digitalization of companies processes has enhanced the treatment and analysis of a growing volume
of data from heterogeneous sources, with emerging challenges, namely those related to knowledge representation.
The Criminal Police has similar challenges, considering the amount of unstructured data from
police reports manually analyzed by criminal investigators, with the corresponding time and resources.
There is a need to automatically extract and represent the unstructured data existing in criminal-related
documents and reduce the manual analysis by criminal investigators. Computer science faces a challenge
to apply emergent computational models that can be an alternative to extract and represent the data using
new or existing methods.
A broad set of computational methods have been applied to the criminal domain, such as the identification
and classification named-entities (NEs) or extraction of relations between the entities that are relevant for
the criminal investigation, like narcotics. However, these methods have mainly been used in the English
language. In Portugal, the research on this domain, applying computational methods, lacks related works,
making its application in criminal investigation unfeasible.
This thesis proposes an integrated solution for the representation of unstructured data retrieved from
documents, using a set of computational methods, such as Preprocessing Criminal-Related Documents
module. This module is supported by Extraction, Transformation, and Loading tasks. Followed by a
Natural Language Processing pipeline applied to the Portuguese language, for syntactic and semantic
analysis of textual data. Next, the 5W1H Information Extraction Method combines the Named-Entity
Recognition, Semantic Role Labelling, and Criminal Terms Extraction tasks. Finally, the Graph Database
Population and Enrichment allows us the representation of data retrieved into a Neo4j graph database.
Globally, the framework presents promising results that were validated using prototypes developed for this
purpose. In addition, the feasibility of extracting unstructured data, its syntactic and semantic interpretation,
and the graph database representation has also been demonstrated
An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web
Although a great deal of attention has been paid to how conspiracy theories
circulate on social media and their factual counterpart conspiracies, there has
been little computational work done on describing their narrative structures.
We present an automated pipeline for the discovery and description of the
generative narrative frameworks of conspiracy theories on social media, and
actual conspiracies reported in the news media. We base this work on two
separate repositories of posts and news articles describing the well-known
conspiracy theory Pizzagate from 2016, and the New Jersey conspiracy Bridgegate
from 2013. We formulate a graphical generative machine learning model where
nodes represent actors/actants, and multi-edges and self-loops among nodes
capture context-specific relationships. Posts and news items are viewed as
samples of subgraphs of the hidden narrative network. The problem of
reconstructing the underlying structure is posed as a latent model estimation
problem. We automatically extract and aggregate the actants and their
relationships from the posts and articles. We capture context specific actants
and interactant relationships by developing a system of supernodes and
subnodes. We use these to construct a network, which constitutes the underlying
narrative framework. We show how the Pizzagate framework relies on the
conspiracy theorists' interpretation of "hidden knowledge" to link otherwise
unlinked domains of human interaction, and hypothesize that this multi-domain
focus is an important feature of conspiracy theories. While Pizzagate relies on
the alignment of multiple domains, Bridgegate remains firmly rooted in the
single domain of New Jersey politics. We hypothesize that the narrative
framework of a conspiracy theory might stabilize quickly in contrast to the
narrative framework of an actual one, which may develop more slowly as
revelations come to light.Comment: conspiracy theory, narrative structur
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
- …