13,603 research outputs found
The Beylkin-Cramer Summation Rule and A New Fast Algorithm of Cosmic Statistics for Large Data Sets
Based on the Beylkin-Cramer summation rule, we introduce a new fast algorithm
that enable us to explore the high order statistics efficiently in large data
sets. Central to this technique is to make decomposition both of fields and
operators within the framework of multi-resolution analysis (MRA), and realize
theirs discrete representations. Accordingly, a homogenous point process could
be equivalently described by a operation of a Toeplitz matrix on a vector,
which is accomplished by making use of fast Fourier transformation. The
algorithm could be applied widely in the cosmic statistics to tackle large data
sets. Especially, we demonstrate this novel technique using the spherical,
cubic and cylinder counts in cells respectively. The numerical test shows that
the algorithm produces an excellent agreement with the expected results.
Moreover, the algorithm introduces naturally a sharp-filter, which is capable
of suppressing shot noise in weak signals. In the numerical procedures, the
algorithm is somewhat similar to particle-mesh (PM) methods in N-body
simulations. As scaled with , it is significantly faster than the
current particle-based methods, and its computational cost does not relies on
shape or size of sampling cells. In addition, based on this technique, we
propose further a simple fast scheme to compute the second statistics for
cosmic density fields and justify it using simulation samples. Hopefully, the
technique developed here allows us to make a comprehensive study of
non-Guassianity of the cosmic fields in high precision cosmology. A specific
implementation of the algorithm is publicly available upon request to the
author.Comment: 27 pages, 9 figures included. revised version, changes include (a)
adding a new fast algorithm for 2nd statistics (b) more numerical tests
including counts in asymmetric cells, the two-point correlation functions and
2nd variances (c) more discussions on technic
Dynamic load balancing in parallel KD-tree k-means
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis.
Techniques for improving the efficiency of k-Means have been
largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing
issue. Three solutions have been developed and tested. Two
approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy
DSL: Discriminative Subgraph Learning via Sparse Self-Representation
The goal in network state prediction (NSP) is to classify the global state
(label) associated with features embedded in a graph. This graph structure
encoding feature relationships is the key distinctive aspect of NSP compared to
classical supervised learning. NSP arises in various applications: gene
expression samples embedded in a protein-protein interaction (PPI) network,
temporal snapshots of infrastructure or sensor networks, and fMRI coherence
network samples from multiple subjects to name a few. Instances from these
domains are typically ``wide'' (more features than samples), and thus, feature
sub-selection is required for robust and generalizable prediction. How to best
employ the network structure in order to learn succinct connected subgraphs
encompassing the most discriminative features becomes a central challenge in
NSP. Prior work employs connected subgraph sampling or graph smoothing within
optimization frameworks, resulting in either large variance of quality or weak
control over the connectivity of selected subgraphs.
In this work we propose an optimization framework for discriminative subgraph
learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity
and (iii) high discriminative power of the resulting subgraphs of features. Our
optimization algorithm is a single-step solution for the NSP and the associated
feature selection problem. It is rooted in the rich literature on
maximal-margin optimization, spectral graph methods and sparse subspace
self-representation. DSL simultaneously ensures solution interpretability and
superior predictive power (up to 16% improvement in challenging instances
compared to baselines), with execution times up to an hour for large instances.Comment: 9 page
- âŠ