6,376 research outputs found
Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
We study clustering algorithms based on neighborhood graphs on a random
sample of data points. The question we ask is how such a graph should be
constructed in order to obtain optimal clustering results. Which type of
neighborhood graph should one choose, mutual k-nearest neighbor or symmetric
k-nearest neighbor? What is the optimal parameter k? In our setting, clusters
are defined as connected components of the t-level set of the underlying
probability distribution. Clusters are said to be identified in the
neighborhood graph if connected components in the graph correspond to the true
underlying clusters. Using techniques from random geometric graph theory, we
prove bounds on the probability that clusters are identified successfully, both
in a noise-free and in a noisy setting. Those bounds lead to several
conclusions. First, k has to be chosen surprisingly high (rather of the order n
than of the order log n) to maximize the probability of cluster identification.
Secondly, the major difference between the mutual and the symmetric k-nearest
neighbor graph occurs when one attempts to detect the most significant cluster
only.Comment: 31 pages, 2 figure
Recurrence-based time series analysis by means of complex network methods
Complex networks are an important paradigm of modern complex systems sciences
which allows quantitatively assessing the structural properties of systems
composed of different interacting entities. During the last years, intensive
efforts have been spent on applying network-based concepts also for the
analysis of dynamically relevant higher-order statistical properties of time
series. Notably, many corresponding approaches are closely related with the
concept of recurrence in phase space. In this paper, we review recent
methodological advances in time series analysis based on complex networks, with
a special emphasis on methods founded on recurrence plots. The potentials and
limitations of the individual methods are discussed and illustrated for
paradigmatic examples of dynamical systems as well as for real-world time
series. Complex network measures are shown to provide information about
structural features of dynamical systems that are complementary to those
characterized by other methods of time series analysis and, hence,
substantially enrich the knowledge gathered from other existing (linear as well
as nonlinear) approaches.Comment: To be published in International Journal of Bifurcation and Chaos
(2011
Algorithms for Stable Matching and Clustering in a Grid
We study a discrete version of a geometric stable marriage problem originally
proposed in a continuous setting by Hoffman, Holroyd, and Peres, in which
points in the plane are stably matched to cluster centers, as prioritized by
their distances, so that each cluster center is apportioned a set of points of
equal area. We show that, for a discretization of the problem to an
grid of pixels with centers, the problem can be solved in time , and we experiment with two slower but more practical algorithms and
a hybrid method that switches from one of these algorithms to the other to gain
greater efficiency than either algorithm alone. We also show how to combine
geometric stable matchings with a -means clustering algorithm, so as to
provide a geometric political-districting algorithm that views distance in
economic terms, and we experiment with weighted versions of stable -means in
order to improve the connectivity of the resulting clusters.Comment: 23 pages, 12 figures. To appear (without the appendices) at the 18th
International Workshop on Combinatorial Image Analysis, June 19-21, 2017,
Plovdiv, Bulgari
Outlier Edge Detection Using Random Graph Generation Models and Applications
Outliers are samples that are generated by different mechanisms from other
normal data samples. Graphs, in particular social network graphs, may contain
nodes and edges that are made by scammers, malicious programs or mistakenly by
normal users. Detecting outlier nodes and edges is important for data mining
and graph analytics. However, previous research in the field has merely focused
on detecting outlier nodes. In this article, we study the properties of edges
and propose outlier edge detection algorithms using two random graph generation
models. We found that the edge-ego-network, which can be defined as the induced
graph that contains two end nodes of an edge, their neighboring nodes and the
edges that link these nodes, contains critical information to detect outlier
edges. We evaluated the proposed algorithms by injecting outlier edges into
some real-world graph data. Experiment results show that the proposed
algorithms can effectively detect outlier edges. In particular, the algorithm
based on the Preferential Attachment Random Graph Generation model consistently
gives good performance regardless of the test graph data. Further more, the
proposed algorithms are not limited in the area of outlier edge detection. We
demonstrate three different applications that benefit from the proposed
algorithms: 1) a preprocessing tool that improves the performance of graph
clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel
noisy data clustering algorithm. These applications show the great potential of
the proposed outlier edge detection techniques.Comment: 14 pages, 5 figures, journal pape
Analysis of a large-scale weighted network of one-to-one human communication
We construct a connected network of 3.9 million nodes from mobile phone call
records, which can be regarded as a proxy for the underlying human
communication network at the societal level. We assign two weights on each edge
to reflect the strength of social interaction, which are the aggregate call
duration and the cumulative number of calls placed between the individuals over
a period of 18 weeks. We present a detailed analysis of this weighted network
by examining its degree, strength, and weight distributions, as well as its
topological assortativity and weighted assortativity, clustering and weighted
clustering, together with correlations between these quantities. We give an
account of motif intensity and coherence distributions and compare them to a
randomized reference system. We also use the concept of link overlap to measure
the number of common neighbors any two adjacent nodes have, which serves as a
useful local measure for identifying the interconnectedness of communities. We
report a positive correlation between the overlap and weight of a link, thus
providing strong quantitative evidence for the weak ties hypothesis, a central
concept in social network analysis. The percolation properties of the network
are found to depend on the type and order of removed links, and they can help
understand how the local structure of the network manifests itself at the
global level. We hope that our results will contribute to modeling weighted
large-scale social networks, and believe that the systematic approach followed
here can be adopted to study other weighted networks.Comment: 25 pages, 17 figures, 2 table
One-class classifiers based on entropic spanning graphs
One-class classifiers offer valuable tools to assess the presence of outliers
in data. In this paper, we propose a design methodology for one-class
classifiers based on entropic spanning graphs. Our approach takes into account
the possibility to process also non-numeric data by means of an embedding
procedure. The spanning graph is learned on the embedded input data and the
outcoming partition of vertices defines the classifier. The final partition is
derived by exploiting a criterion based on mutual information minimization.
Here, we compute the mutual information by using a convenient formulation
provided in terms of the -Jensen difference. Once training is
completed, in order to associate a confidence level with the classifier
decision, a graph-based fuzzy model is constructed. The fuzzification process
is based only on topological information of the vertices of the entropic
spanning graph. As such, the proposed one-class classifier is suitable also for
data characterized by complex geometric structures. We provide experiments on
well-known benchmarks containing both feature vectors and labeled graphs. In
addition, we apply the method to the protein solubility recognition problem by
considering several representations for the input samples. Experimental results
demonstrate the effectiveness and versatility of the proposed method with
respect to other state-of-the-art approaches.Comment: Extended and revised version of the paper "One-Class Classification
Through Mutual Information Minimization" presented at the 2016 IEEE IJCNN,
Vancouver, Canad
- …