10,390 research outputs found
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavy-tailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a document-level prior of entity co-occurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for real-time usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing state-of-the-art methods
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
In practical data integration systems, it is common for the data sources
being integrated to provide conflicting information about the same entity.
Consequently, a major challenge for data integration is to derive the most
complete and accurate integrated records from diverse and sometimes conflicting
sources. We term this challenge the truth finding problem. We observe that some
sources are generally more reliable than others, and therefore a good model of
source quality is the key to solving the truth finding problem. In this work,
we propose a probabilistic graphical model that can automatically infer true
records and source quality without any supervision. In contrast to previous
methods, our principled approach leverages a generative process of two types of
errors (false positive and false negative) by modeling two different aspects of
source quality. In so doing, ours is also the first approach designed to merge
multi-valued attribute types. Our method is scalable, due to an efficient
sampling-based inference algorithm that needs very few iterations in practice
and enjoys linear time complexity, with an even faster incremental variant.
Experiments on two real world datasets show that our new method outperforms
existing state-of-the-art approaches to the truth finding problem.Comment: VLDB201
Incrementally Learned Mixture Models for GNSS Localization
GNSS localization is an important part of today's autonomous systems,
although it suffers from non-Gaussian errors caused by non-line-of-sight
effects. Recent methods are able to mitigate these effects by including the
corresponding distributions in the sensor fusion algorithm. However, these
approaches require prior knowledge about the sensor's distribution, which is
often not available. We introduce a novel sensor fusion algorithm based on
variational Bayesian inference, that is able to approximate the true
distribution with a Gaussian mixture model and to learn its parametrization
online. The proposed Incremental Variational Mixture algorithm automatically
adapts the number of mixture components to the complexity of the measurement's
error distribution. We compare the proposed algorithm against current
state-of-the-art approaches using a collection of open access real world
datasets and demonstrate its superior localization accuracy.Comment: 8 pages, 5 figures, published in proceedings of IEEE Intelligent
Vehicles Symposium (IV) 201
Inferring causal relations from multivariate time series : a fast method for large-scale gene expression data
Various multivariate time series analysis techniques have been developed with the aim of inferring causal relations between time series. Previously, these techniques have proved their effectiveness on economic and neurophysiological data, which normally consist of hundreds of samples. However, in their applications to gene regulatory inference, the small sample size of gene expression time series poses an obstacle. In this paper, we describe some of the most commonly used multivariate inference techniques and show the potential challenge related to gene expression analysis. In response, we propose a directed partial correlation (DPC) algorithm as an efficient and effective solution to causal/regulatory relations inference on small sample gene expression data. Comparative evaluations on the existing techniques and the proposed method are presented. To draw reliable conclusions, a comprehensive benchmarking on data sets of various setups is essential. Three experiments are designed to assess these methods in a coherent manner. Detailed analysis of experimental results not only reveals good accuracy of the proposed DPC method in large-scale prediction, but also gives much insight into all methods under evaluation
Deep Joint Entity Disambiguation with Local Neural Attention
We propose a novel deep learning model for joint document-level entity
disambiguation, which leverages learned neural representations. Key components
are entity embeddings, a neural attention mechanism over local context windows,
and a differentiable joint inference stage for disambiguation. Our approach
thereby combines benefits of deep learning with more traditional approaches
such as graphical models and probabilistic mention-entity maps. Extensive
experiments show that we are able to obtain competitive or state-of-the-art
accuracy at moderate computational costs.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP) 2017 long pape
- …