113,994 research outputs found
Clustering cliques for graph-based summarization of the biomedical research literature
BACKGROUND: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). RESULTS: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. CONCLUSIONS: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively
Substructure Discovery Using Minimum Description Length and Background Knowledge
The ability to identify interesting and repetitive substructures is an
essential component to discovering knowledge in structural data. We describe a
new version of our SUBDUE substructure discovery system based on the minimum
description length principle. The SUBDUE system discovers substructures that
compress the original data and represent structural concepts in the data. By
replacing previously-discovered substructures in the data, multiple passes of
SUBDUE produce a hierarchical description of the structural regularities in the
data. SUBDUE uses a computationally-bounded inexact graph match that identifies
similar, but not identical, instances of a substructure and finds an
approximate measure of closeness of two substructures when under computational
constraints. In addition to the minimum description length principle, other
background knowledge can be used by SUBDUE to guide the search towards more
appropriate substructures. Experiments in a variety of domains demonstrate
SUBDUE's ability to find substructures capable of compressing the original data
and to discover structural concepts important to the domain. Description of
Online Appendix: This is a compressed tar file containing the SUBDUE discovery
system, written in C. The program accepts as input databases represented in
graph form, and will output discovered substructures with their corresponding
value.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
Fast Robust PCA on Graphs
Mining useful clusters from high dimensional data has received significant
attention of the computer vision and pattern recognition community in the
recent years. Linear and non-linear dimensionality reduction has played an
important role to overcome the curse of dimensionality. However, often such
methods are accompanied with three different problems: high computational
complexity (usually associated with the nuclear norm minimization),
non-convexity (for matrix factorization methods) and susceptibility to gross
corruptions in the data. In this paper we propose a principal component
analysis (PCA) based solution that overcomes these three issues and
approximates a low-rank recovery method for high dimensional datasets. We
target the low-rank recovery by enforcing two types of graph smoothness
assumptions, one on the data samples and the other on the features by designing
a convex optimization problem. The resulting algorithm is fast, efficient and
scalable for huge datasets with O(nlog(n)) computational complexity in the
number of data samples. It is also robust to gross corruptions in the dataset
as well as to the model parameters. Clustering experiments on 7 benchmark
datasets with different types of corruptions and background separation
experiments on 3 video datasets show that our proposed model outperforms 10
state-of-the-art dimensionality reduction models. Our theoretical analysis
proves that the proposed model is able to recover approximate low-rank
representations with a bounded error for clusterable data
Community Structure Detection in Complex Networks with Partial Background Information
Constrained clustering has been well-studied in the unsupervised learning
society. However, how to encode constraints into community structure detection,
within complex networks, remains a challenging problem. In this paper, we
propose a semi-supervised learning framework for community structure detection.
This framework implicitly encodes the must-link and cannot-link constraints by
modifying the adjacency matrix of network, which can also be regarded as
de-noising the consensus matrix of community structures. Our proposed method
gives consideration to both the topology and the functions (background
information) of complex network, which enhances the interpretability of the
results. The comparisons performed on both the synthetic benchmarks and the
real-world networks show that the proposed framework can significantly improve
the community detection performance with few constraints, which makes it an
attractive methodology in the analysis of complex networks
- …