40,864 research outputs found
Web Document Clustering Using Document Index Graph
Document Clustering is an important tool for many Information Retrieval (IR) tasks. The huge increase in amount of information present on web poses new challenges in clustering regarding to underlying data model and nature of clustering algorithm. Document clustering techniques mostly rely on single term analysis of document data set. To achieve more accurate document clustering, more informative feature such as phrases are important in this scenario. Hence first part of the paper presents phrase-based model, Document Index Graph (DIG), which allows incremental phrase-based encoding of documents and efficient phrase matching. It emphasizes on effectiveness of phrase-based similarity measure over traditional single term based similarities. In the second part, a Document Index Graph based Clustering (DIGBC) algorithm is proposed to enhance the DIG model for incremental and soft clustering. This algorithm incrementally clusters documents based on proposed clusterdocument similarity measure. It allows assignment of a document to more than one cluster. The DIGBC algorithm is more efficient as compared to existing clustering algorithms such as single pass, K-NN and Hierarchical Agglomerative Clustering (HAC) algorithm
Incremental Entity Resolution from Linked Documents
In many government applications we often find that information about
entities, such as persons, are available in disparate data sources such as
passports, driving licences, bank accounts, and income tax records. Similar
scenarios are commonplace in large enterprises having multiple customer,
supplier, or partner databases. Each data source maintains different aspects of
an entity, and resolving entities based on these attributes is a well-studied
problem. However, in many cases documents in one source reference those in
others; e.g., a person may provide his driving-licence number while applying
for a passport, or vice-versa. These links define relationships between
documents of the same entity (as opposed to inter-entity relationships, which
are also often used for resolution). In this paper we describe an algorithm to
cluster documents that are highly likely to belong to the same entity by
exploiting inter-document references in addition to attribute similarity. Our
technique uses a combination of iterative graph-traversal, locality-sensitive
hashing, iterative match-merge, and graph-clustering to discover unique
entities based on a document corpus. A unique feature of our technique is that
new sets of documents can be added incrementally while having to re-resolve
only a small subset of a previously resolved entity-document collection. We
present performance and quality results on two data-sets: a real-world database
of companies and a large synthetically generated `population' database. We also
demonstrate benefit of using inter-document references for clustering in the
form of enhanced recall of documents for resolution.Comment: 15 pages, 8 figures, patented wor
Political Text Scaling Meets Computational Semantics
During the last fifteen years, automatic text scaling has become one of the
key tools of the Text as Data community in political science. Prominent text
scaling algorithms, however, rely on the assumption that latent positions can
be captured just by leveraging the information about word frequencies in
documents under study. We challenge this traditional view and present a new,
semantically aware text scaling algorithm, SemScale, which combines recent
developments in the area of computational linguistics with unsupervised
graph-based clustering. We conduct an extensive quantitative analysis over a
collection of speeches from the European Parliament in five different languages
and from two different legislative terms, and show that a scaling approach
relying on semantic document representations is often better at capturing known
underlying political dimensions than the established frequency-based (i.e.,
symbolic) scaling method. We further validate our findings through a series of
experiments focused on text preprocessing and feature selection, document
representation, scaling of party manifestos, and a supervised extension of our
algorithm. To catalyze further research on this new branch of text scaling
methods, we release a Python implementation of SemScale with all included data
sets and evaluation procedures.Comment: Updated version - accepted for Transactions on Data Science (TDS
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
Adaptive Graph via Multiple Kernel Learning for Nonnegative Matrix Factorization
Nonnegative Matrix Factorization (NMF) has been continuously evolving in
several areas like pattern recognition and information retrieval methods. It
factorizes a matrix into a product of 2 low-rank non-negative matrices that
will define parts-based, and linear representation of nonnegative data.
Recently, Graph regularized NMF (GrNMF) is proposed to find a compact
representation,which uncovers the hidden semantics and simultaneously respects
the intrinsic geometric structure. In GNMF, an affinity graph is constructed
from the original data space to encode the geometrical information. In this
paper, we propose a novel idea which engages a Multiple Kernel Learning
approach into refining the graph structure that reflects the factorization of
the matrix and the new data space. The GrNMF is improved by utilizing the graph
refined by the kernel learning, and then a novel kernel learning method is
introduced under the GrNMF framework. Our approach shows encouraging results of
the proposed algorithm in comparison to the state-of-the-art clustering
algorithms like NMF, GrNMF, SVD etc.Comment: This paper has been withdrawn by the author due to the terrible
writin
- …