49 research outputs found
Extensible Database Simulator for Fast Prototyping In-Database Algorithms
With the rapid increasing of data scale, in-database analytics and learning
has become one of the most studied topics in data science community, because of
its significance on reducing the gap between the management and the analytics
of data. By extending the capability of database on analytics and learning,
data scientists can save much time on exchanging data between databases and
external analytic tools. For this goal, researchers are attempting to integrate
more data science algorithms into database. However, implementing the
algorithms in mainstream databases is super time-consuming, especially when it
is necessary to have a deep dive into the database kernels. Thus there are
demands for an easy-to-extend database simulator to help fast prototype and
verify the in-database algorithms before implementing them in real databases.
In this demo, we present such an extensible relational database simulator,
DBSim, to help data scientists prototype their in-database analytics and
learning algorithms and verify the effectiveness of their ideas with minimal
cost. DBSim simulates a real relational database by integrating all the major
components in mainstream RDBMS, including SQL parser, relational operators,
query optimizer, etc. In addition, DBSim provides various interfaces for users
to flexibly plug their custom extension modules into any of the major
components, without modifying the kernel. By those interfaces, DBSim supports
easy extensions on SQL syntax, relational operators, query optimizer rules and
cost models, and physical plan execution. Furthermore, DBSim provides utilities
to facilitate users' developing and debugging, like query plan visualizer and
interactive analyzer on optimization rules. We develop DBSim using pure Python
to support seamless implementation of most data science algorithms into it,
since many of them are written in Python
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval
Many recent approaches of passage retrieval are using dense embeddings
generated from deep neural models, called "dense passage retrieval". The
state-of-the-art end-to-end dense passage retrieval systems normally deploy a
deep neural model followed by an approximate nearest neighbor (ANN) search
module. The model generates embeddings of the corpus and queries, which are
then indexed and searched by the high-performance ANN module. With the
increasing data scale, the ANN module unavoidably becomes the bottleneck on
efficiency. An alternative is the learned index, which achieves significantly
high search efficiency by learning the data distribution and predicting the
target data location. But most of the existing learned indexes are designed for
low dimensional data, which are not suitable for dense passage retrieval with
high-dimensional dense embeddings. In this paper, we propose LIDER, an
efficient high-dimensional Learned Index for large-scale DEnse passage
Retrieval. LIDER has a clustering-based hierarchical architecture formed by two
layers of core models. As the basic unit of LIDER to index and search data, a
core model includes an adapted recursive model index (RMI) and a dimension
reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and
a key re-scaling module. The dimension reduction component reduces the
high-dimensional dense embeddings into one-dimensional keys and sorts them in a
specific order, which are then used by the RMI to make fast prediction.
Experiments show that LIDER has a higher search speed with high retrieval
quality comparing to the state-of-the-art ANN indexes on passage retrieval
tasks, e.g., on large-scale data it achieves 1.2x search speed and
significantly higher retrieval quality than the fastest baseline in our
evaluation. Furthermore, LIDER has a better capability of speed-quality
trade-off.Comment: Accepted by VLDB 202
MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering
Check-worthy claim detection aims at providing plausible misinformation to
downstream fact-checking systems or human experts to check. This is a crucial
step toward accelerating the fact-checking process. Many efforts have been put
into how to identify check-worthy claims from a small scale of pre-collected
claims, but how to efficiently detect check-worthy claims directly from a
large-scale information source, such as Twitter, remains underexplored. To fill
this gap, we introduce MythQA, a new multi-answer open-domain question
answering(QA) task that involves contradictory stance mining for query-based
large-scale check-worthy claim detection. The idea behind this is that
contradictory claims are a strong indicator of misinformation that merits
scrutiny by the appropriate authorities. To study this task, we construct
TweetMythQA, an evaluation dataset containing 522 factoid multi-answer
questions based on controversial topics. Each question is annotated with
multiple answers. Moreover, we collect relevant tweets for each distinct
answer, then classify them into three categories: "Supporting", "Refuting", and
"Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is
collected for all answers in the dataset. Finally, we present a baseline system
for MythQA and evaluate existing NLP models for each system component using the
TweetMythQA dataset. We provide initial benchmarks and identify key challenges
for future models to improve upon. Code and data are available at:
https://github.com/TonyBY/Myth-QAComment: Accepted by SIGIR 202
ChronoR: Rotation Based Temporal Knowledge Graph Embedding
Despite the importance and abundance of temporal knowledge graphs, most of
the current research has been focused on reasoning on static graphs. In this
paper, we study the challenging problem of inference over temporal knowledge
graphs. In particular, the task of temporal link prediction. In general, this
is a difficult task due to data non-stationarity, data heterogeneity, and its
complex temporal dependencies. We propose Chronological Rotation embedding
(ChronoR), a novel model for learning representations for entities, relations,
and time. Learning dense representations is frequently used as an efficient and
versatile method to perform reasoning on knowledge graphs. The proposed model
learns a k-dimensional rotation transformation parametrized by relation and
time, such that after each fact's head entity is transformed using the
rotation, it falls near its corresponding tail entity. By using high
dimensional rotation as its transformation operator, ChronoR captures rich
interaction between the temporal and multi-relational characteristics of a
Temporal Knowledge Graph. Experimentally, we show that ChronoR is able to
outperform many of the state-of-the-art methods on the benchmark datasets for
temporal knowledge graph link prediction
Can Knowledge Graphs Simplify Text?
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in
generating fluent and informative sentences which describe a given KG. As KGs
are widespread across multiple domains and contain important entity-relation
information, and as text simplification aims to reduce the complexity of a text
while preserving the meaning of the original text, we propose KGSimple, a novel
approach to unsupervised text simplification which infuses KG-established
techniques in order to construct a simplified KG path and generate a concise
text which preserves the original input's meaning. Through an iterative and
sampling KG-first approach, our model is capable of simplifying text when
starting from a KG by learning to keep important information while harnessing
KG-to-text generation to output fluent and descriptive sentences. We evaluate
various settings of the KGSimple model on currently-available KG-to-text
datasets, demonstrating its effectiveness compared to unsupervised text
simplification models which start with a given complex text. Our code is
available on GitHub.Comment: Accepted as a Main Conference Long Paper at CIKM 202