1,039 research outputs found
Distributed Holistic Clustering on Linked Data
Link discovery is an active field of research to support data integration in
the Web of Data. Due to the huge size and number of available data sources,
efficient and effective link discovery is a very challenging task. Common
pairwise link discovery approaches do not scale to many sources with very large
entity sets. We here propose a distributed holistic approach to link many data
sources based on a clustering of entities that represent the same real-world
object. Our clustering approach provides a compact and fused representation of
entities, and can identify errors in existing links as well as many new links.
We support a distributed execution of the clustering approach to achieve faster
execution times and scalability for large real-world data sets. We provide a
novel gold standard for multi-source clustering, and evaluate our methods with
respect to effectiveness and efficiency for large data sets from the geographic
and music domains
Cloud-Scale Entity Resolution: Current State and Open Challenges
Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field
mARC: Memory by Association and Reinforcement of Contexts
This paper introduces the memory by Association and Reinforcement of Contexts
(mARC). mARC is a novel data modeling technology rooted in the second
quantization formulation of quantum mechanics. It is an all-purpose incremental
and unsupervised data storage and retrieval system which can be applied to all
types of signal or data, structured or unstructured, textual or not. mARC can
be applied to a wide range of information clas-sification and retrieval
problems like e-Discovery or contextual navigation. It can also for-mulated in
the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast
to Conway approach, the objects evolve in a massively multidimensional space.
In order to start evaluating the potential of mARC we have built a mARC-based
Internet search en-gine demonstrator with contextual functionality. We compare
the behavior of the mARC demonstrator with Google search both in terms of
performance and relevance. In the study we find that the mARC search engine
demonstrator outperforms Google search by an order of magnitude in response
time while providing more relevant results for some classes of queries
Scalable Methods for Adaptively Seeding a Social Network
In recent years, social networking platforms have developed into
extraordinary channels for spreading and consuming information. Along with the
rise of such infrastructure, there is continuous progress on techniques for
spreading information effectively through influential users. In many
applications, one is restricted to select influencers from a set of users who
engaged with the topic being promoted, and due to the structure of social
networks, these users often rank low in terms of their influence potential. An
alternative approach one can consider is an adaptive method which selects users
in a manner which targets their influential neighbors. The advantage of such an
approach is that it leverages the friendship paradox in social networks: while
users are often not influential, they often know someone who is.
Despite the various complexities in such optimization problems, we show that
scalable adaptive seeding is achievable. In particular, we develop algorithms
for linear influence models with provable approximation guarantees that can be
gracefully parallelized. To show the effectiveness of our methods we collected
data from various verticals social network users follow. For each vertical, we
collected data on the users who responded to a certain post as well as their
neighbors, and applied our methods on this data. Our experiments show that
adaptive seeding is scalable, and importantly, that it obtains dramatic
improvements over standard approaches of information dissemination.Comment: Full version of the paper appearing in WWW 201
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
- …