47,123 research outputs found
Document Clustering based on Topic Maps
Importance of document clustering is now widely acknowledged by researchers
for better management, smart navigation, efficient filtering, and concise
summarization of large collection of documents like World Wide Web (WWW). The
next challenge lies in semantically performing clustering based on the semantic
contents of the document. The problem of document clustering has two main
components: (1) to represent the document in such a form that inherently
captures semantics of the text. This may also help to reduce dimensionality of
the document, and (2) to define a similarity measure based on the semantic
representation such that it assigns higher numerical values to document pairs
which have higher semantic relationship. Feature space of the documents can be
very challenging for document clustering. A document may contain multiple
topics, it may contain a large set of class-independent general-words, and a
handful class-specific core-words. With these features in mind, traditional
agglomerative clustering algorithms, which are based on either Document Vector
model (DVM) or Suffix Tree model (STC), are less efficient in producing results
with high cluster quality. This paper introduces a new approach for document
clustering based on the Topic Map representation of the documents. The document
is being transformed into a compact form. A similarity measure is proposed
based upon the inferred information through topic maps data and structures. The
suggested method is implemented using agglomerative hierarchal clustering and
tested on standard Information retrieval (IR) datasets. The comparative
experiment reveals that the proposed approach is effective in improving the
cluster quality
Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities
Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common coâembedded space that preserves higherâorder, neighborâbased proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted coâembedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields
Development of novel fuzzy clustering techniques in the context of e-learning
This thesis investigates the performance of fuzzy clustering for dynamically discovering content relationships in e-Learning material based on document metadata descriptions. This form of knowledge representation is exploited to enable flexible content navigation in eLearning environments. However, the methods and tools developed in this thesis have wider applicability. The purpose of clustering techniques is to determine underlying structures and relations in data sets usually based on distance or proximity measures. A number of clustering methods to suit particular applications have been developed throughout the years. This thesis specifically considers the well-known Fuzzy c-Means (FCM) clustering technique as the basis for document clustering. Initially, novel expressions are developed to extend the FCM algorithm, which is based on the Euclidean metric, to an algorithm based on other proximity measures more appropriate for quantifying document relationships. These include the cosine, Jaccard and overlap similarity coefficients. This novel algorithm works with normalised k-dimensional data vectors that lie in hyper-sphere of unit radius and hence has been named Hyper-Spherical Fuzzy c-Means (H-FCN). Subsequently, the performance of the H-FCM algorithm is compared to that of the FCM as well as conventional hard (ie non-fuzzy) clustering algorithms with respect to four test document collections. Both the impact of different proximity measures as well as the impact of pre-processing the document vector representations for dimensionality reduction are thoroughly investigated. Results demonstrate that the H-FCM clustering method outperforms both the conventional FCM method as well as hard clustering techniques. This thesis also considers the integration of fuzzy clustering techniques in an end-to- end e-Leaming system. In particular, a tool to convert the H-FCM document clustering outcome into a knowledge representation, based on the Topic Maps standard, suitable for Web-based environments is developed. Moreover, a tool to enable flexible navigation of e-Learning material based on the fuzzy knowledge space is also developed. This tool is deployed in a real e-Learning environment where user trials are carried out. Finally, this thesis considers the important problem of defining a suitable number of clusters for appropriately capturing the concepts of the knowledge space. In particular, an hierarchical H-FCM algorithm is developed where the sought granularity level defines the number of clusters. In this algorithm, a novel heuristic based on asymmetric similarity measures is exploited to link document clusters hierarchically and to form a topic hierarchy
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
Patent Analytics Based on Feature Vector Space Model: A Case of IoT
The number of approved patents worldwide increases rapidly each year, which
requires new patent analytics to efficiently mine the valuable information
attached to these patents. Vector space model (VSM) represents documents as
high-dimensional vectors, where each dimension corresponds to a unique term.
While originally proposed for information retrieval systems, VSM has also seen
wide applications in patent analytics, and used as a fundamental tool to map
patent documents to structured data. However, VSM method suffers from several
limitations when applied to patent analysis tasks, such as loss of
sentence-level semantics and curse-of-dimensionality problems. In order to
address the above limitations, we propose a patent analytics based on feature
vector space model (FVSM), where the FVSM is constructed by mapping patent
documents to feature vectors extracted by convolutional neural networks (CNN).
The applications of FVSM for three typical patent analysis tasks, i.e., patents
similarity comparison, patent clustering, and patent map generation are
discussed. A case study using patents related to Internet of Things (IoT)
technology is illustrated to demonstrate the performance and effectiveness of
FVSM. The proposed FVSM can be adopted by other patent analysis studies to
replace VSM, based on which various big data learning tasks can be performed
The Extraction of Community Structures from Publication Networks to Support Ethnographic Observations of Field Differences in Scientific Communication
The scientific community of researchers in a research specialty is an
important unit of analysis for understanding the field specific shaping of
scientific communication practices. These scientific communities are, however,
a challenging unit of analysis to capture and compare because they overlap,
have fuzzy boundaries, and evolve over time. We describe a network analytic
approach that reveals the complexities of these communities through examination
of their publication networks in combination with insights from ethnographic
field studies. We suggest that the structures revealed indicate overlapping
sub- communities within a research specialty and we provide evidence that they
differ in disciplinary orientation and research practices. By mapping the
community structures of scientific fields we aim to increase confidence about
the domain of validity of ethnographic observations as well as of collaborative
patterns extracted from publication networks thereby enabling the systematic
study of field differences. The network analytic methods presented include
methods to optimize the delineation of a bibliographic data set in order to
adequately represent a research specialty, and methods to extract community
structures from this data. We demonstrate the application of these methods in a
case study of two research specialties in the physical and chemical sciences.Comment: Accepted for publication in JASIS
Document Clustering Based On Max-Correntropy Non-Negative Matrix Factorization
Nonnegative matrix factorization (NMF) has been successfully applied to many
areas for classification and clustering. Commonly-used NMF algorithms mainly
target on minimizing the distance or Kullback-Leibler (KL) divergence,
which may not be suitable for nonlinear case. In this paper, we propose a new
decomposition method by maximizing the correntropy between the original and the
product of two low-rank matrices for document clustering. This method also
allows us to learn the new basis vectors of the semantic feature space from the
data. To our knowledge, we haven't seen any work has been done by maximizing
correntropy in NMF to cluster high dimensional document data. Our experiment
results show the supremacy of our proposed method over other variants of NMF
algorithm on Reuters21578 and TDT2 databasets.Comment: International Conference of Machine Learning and Cybernetics (ICMLC)
201
- âŠ