15,070 research outputs found
The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch
Recent and forthcoming advances in instrumentation, and giant new surveys,
are creating astronomical data sets that are not amenable to the methods of
analysis familiar to astronomers. Traditional methods are often inadequate not
merely because of the size in bytes of the data sets, but also because of the
complexity of modern data sets. Mathematical limitations of familiar algorithms
and techniques in dealing with such data sets create a critical need for new
paradigms for the representation, analysis and scientific visualization (as
opposed to illustrative visualization) of heterogeneous, multiresolution data
across application domains. Some of the problems presented by the new data sets
have been addressed by other disciplines such as applied mathematics,
statistics and machine learning and have been utilized by other sciences such
as space-based geosciences. Unfortunately, valuable results pertaining to these
problems are mostly to be found only in publications outside of astronomy. Here
we offer brief overviews of a number of concepts, techniques and developments,
some "old" and some new. These are generally unknown to most of the
astronomical community, but are vital to the analysis and visualization of
complex datasets and images. In order for astronomers to take advantage of the
richness and complexity of the new era of data, and to be able to identify,
adopt, and apply new solutions, the astronomical community needs a certain
degree of awareness and understanding of the new concepts. One of the goals of
this paper is to help bridge the gap between applied mathematics, artificial
intelligence and computer science on the one side and astronomy on the other.Comment: 24 pages, 8 Figures, 1 Table. Accepted for publication: "Advances in
Astronomy, special issue "Robotic Astronomy
Dimensionality reduction of clustered data sets
We present a novel probabilistic latent variable model to perform linear dimensionality reduction on data sets which contain clusters. We prove that the maximum likelihood solution of the model is an unsupervised generalisation of linear discriminant analysis. This provides a completely new approach to one of the most established and widely used classification algorithms. The performance of the model is then demonstrated on a number of real and artificial data sets
Possible thermodynamic structure underlying the laws of Zipf and Benford
We show that the laws of Zipf and Benford, obeyed by scores of numerical data
generated by many and diverse kinds of natural phenomena and human activity are
related to the focal expression of a generalized thermodynamic structure. This
structure is obtained from a deformed type of statistical mechanics that arises
when configurational phase space is incompletely visited in a severe way.
Specifically, the restriction is that the accessible fraction of this space has
fractal properties. The focal expression is an (incomplete) Legendre transform
between two entropy (or Massieu) potentials that when particularized to first
digits leads to a previously existing generalization of Benford's law. The
inverse functional of this expression leads to Zipf's law; but it naturally
includes the bends or tails observed in real data for small and large rank.
Remarkably, we find that the entire problem is analogous to the transition to
chaos via intermittency exhibited by low-dimensional nonlinear maps. Our
results also explain the generic form of the degree distribution of scale-free
networks.Comment: To be published in European Physical Journal
Kernel Truncated Regression Representation for Robust Subspace Clustering
Subspace clustering aims to group data points into multiple clusters of which
each corresponds to one subspace. Most existing subspace clustering approaches
assume that input data lie on linear subspaces. In practice, however, this
assumption usually does not hold. To achieve nonlinear subspace clustering, we
propose a novel method, called kernel truncated regression representation. Our
method consists of the following four steps: 1) projecting the input data into
a hidden space, where each data point can be linearly represented by other data
points; 2) calculating the linear representation coefficients of the data
representations in the hidden space; 3) truncating the trivial coefficients to
achieve robustness and block-diagonality; and 4) executing the graph cutting
operation on the coefficient matrix by solving a graph Laplacian problem. Our
method has the advantages of a closed-form solution and the capacity of
clustering data points that lie on nonlinear subspaces. The first advantage
makes our method efficient in handling large-scale datasets, and the second one
enables the proposed method to conquer the nonlinear subspace clustering
challenge. Extensive experiments on six benchmarks demonstrate the
effectiveness and the efficiency of the proposed method in comparison with
current state-of-the-art approaches.Comment: 14 page
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …