5,120 research outputs found
The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch
Recent and forthcoming advances in instrumentation, and giant new surveys,
are creating astronomical data sets that are not amenable to the methods of
analysis familiar to astronomers. Traditional methods are often inadequate not
merely because of the size in bytes of the data sets, but also because of the
complexity of modern data sets. Mathematical limitations of familiar algorithms
and techniques in dealing with such data sets create a critical need for new
paradigms for the representation, analysis and scientific visualization (as
opposed to illustrative visualization) of heterogeneous, multiresolution data
across application domains. Some of the problems presented by the new data sets
have been addressed by other disciplines such as applied mathematics,
statistics and machine learning and have been utilized by other sciences such
as space-based geosciences. Unfortunately, valuable results pertaining to these
problems are mostly to be found only in publications outside of astronomy. Here
we offer brief overviews of a number of concepts, techniques and developments,
some "old" and some new. These are generally unknown to most of the
astronomical community, but are vital to the analysis and visualization of
complex datasets and images. In order for astronomers to take advantage of the
richness and complexity of the new era of data, and to be able to identify,
adopt, and apply new solutions, the astronomical community needs a certain
degree of awareness and understanding of the new concepts. One of the goals of
this paper is to help bridge the gap between applied mathematics, artificial
intelligence and computer science on the one side and astronomy on the other.Comment: 24 pages, 8 Figures, 1 Table. Accepted for publication: "Advances in
Astronomy, special issue "Robotic Astronomy
Clustering by compression
We present a new method for clustering based on compression. The method
doesn't use subject-specific features or background knowledge, and works as
follows: First, we determine a universal similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed data files
(singly and in pairwise concatenation). Second, we apply a hierarchical
clustering method. The NCD is universal in that it is not restricted to a
specific application area, and works across application area boundaries. A
theoretical precursor, the normalized information distance, co-developed by one
of the authors, is provably optimal but uses the non-computable notion of
Kolmogorov complexity. We propose precise notions of similarity metric, normal
compressor, and show that the NCD based on a normal compressor is a similarity
metric that approximates universality. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method is implemented and
available as public software, and is robust under choice of different
compressors. To substantiate our claims of universality and robustness, we
report evidence of successful application in areas as diverse as genomics,
virology, languages, literature, music, handwritten digits, astronomy, and
combinations of objects from completely different domains, using statistical,
dictionary, and block sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution, based on
whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta
hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
Morphonette: a morphological network of French
This paper describes in details the first version of Morphonette, a new
French morphological resource and a new radically lexeme-based method of
morphological analysis. This research is grounded in a paradigmatic conception
of derivational morphology where the morphological structure is a structure of
the entire lexicon and not one of the individual words it contains. The
discovery of this structure relies on a measure of morphological similarity
between words, on formal analogy and on the properties of two morphological
paradigms
- …