29 research outputs found
Exploration of Parameter Spaces in a Virtual Observatory
Like every other field of intellectual endeavor, astronomy is being
revolutionised by the advances in information technology. There is an ongoing
exponential growth in the volume, quality, and complexity of astronomical data
sets, mainly through large digital sky surveys and archives. The Virtual
Observatory (VO) concept represents a scientific and technological framework
needed to cope with this data flood. Systematic exploration of the observable
parameter spaces, covered by large digital sky surveys spanning a range of
wavelengths, will be one of the primary modes of research with a VO. This is
where the truly new discoveries will be made, and new insights be gained about
the already known astronomical objects and phenomena. We review some of the
methodological challenges posed by the analysis of large and complex data sets
expected in the VO-based research. The challenges are driven both by the size
and the complexity of the data sets (billions of data vectors in parameter
spaces of tens or hundreds of dimensions), by the heterogeneity of the data and
measurement errors, including differences in basic survey parameters for the
federated data sets (e.g., in the positional accuracy and resolution,
wavelength coverage, time baseline, etc.), various selection effects, as well
as the intrinsic clustering properties (functional form, topology) of the data
distributions in the parameter spaces of observed attributes. Answering these
challenges will require substantial collaborative efforts and partnerships
between astronomers, computer scientists, and statisticians.Comment: Invited review, 10 pages, Latex file with 4 eps figures, style files
included. To appear in Proc. SPIE, v. 4477 (2001
Data-Mining a Large Digital Sky Survey: From the Challenges to the Scientific Results
The analysis and an efficient scientific exploration of the Digital Palomar
Observatory Sky Survey (DPOSS) represents a major technical challenge. The
input data set consists of 3 Terabytes of pixel information, and contains a few
billion sources. We describe some of the specific scientific problems posed by
the data, including searches for distant quasars and clusters of galaxies, and
the data-mining techniques we are exploring in addressing them.
Machine-assisted discovery methods may become essential for the analysis of
such multi-Terabyte data sets. New and future approaches involve unsupervised
classification and clustering analysis in the Giga-object data space, including
various Bayesian techniques. In addition to the searches for known types of
objects in this data base, these techniques may also offer the possibility of
discovering previously unknown, rare types of astronomical objects.Comment: Invited paper, to appear in Applications of Digital Image Processing
XX, ed. A. Tescher, Proc. S.P.I.E. vol. 3164, in press; 10 pages, a
self-contained TeX file, and 3 separate postscript figure
Distances and classification of amino acids for different protein secondary structures
Window profiles of amino acids in protein sequences are taken as a
description of the amino acid environment. The relative entropy or
Kullback-Leibler distance derived from profiles is used as a measure of
dissimilarity for comparison of amino acids and secondary structure
conformations. Distance matrices of amino acid pairs at different conformations
are obtained, which display a non-negligible dependence of amino acid
similarity on conformations. Based on the conformation specific distances
clustering analysis for amino acids is conducted.Comment: 15 pages, 8 figure
Exploration of Large Digital Sky Surveys
We review some of the scientific opportunities and technical challenges posed
by the exploration of the large digital sky surveys, in the context of a
Virtual Observatory (VO). The VO paradigm will profoundly change the way
observational astronomy is done. Clustering analysis techniques can be used to
discover samples of rare, unusual, or even previously unknown types of
astronomical objects and phenomena. Exploration of the previously poorly probed
portions of the observable parameter space are especially promising. We
illustrate some of the possible types of studies with examples drawn from
DPOSS; much more complex and interesting applications are forthcoming.
Development of the new tools needed for an efficient exploration of these vast
data sets requires a synergy between astronomy and information sciences, with
great potential returns for both fields.Comment: To appear in: Mining the Sky, eds. A. Banday et al., ESO Astrophysics
Symposia, Berlin: Springer Verlag, in press (2001). Latex file, 18 pages, 6
encapsulated postscript figures, style files include
QUEST: Content-based Access to Geophysical Databases
A major challenge facing geophysical science today is the unavailability of high-level analysis tools with which to study the massive amount of data produced by sensors or long simulations of climate models. As part of a NASA HPCC Grand Challenge effort [Mun92], we have developed a prototype environment called QUEST to provide content-based query access to massive datasets used in geophysical applications. QUEST employs work stations as well as massively parallel processors to produce spatio-temporal features that are used as high-level indexes into terabyte datasets. This paper discusses our continued development of the QUEST environment. 1 Introduction A critical challenge facing geophysical science today is the unavailability of high-level analysis tools with which to study the massive amount of information captured by sensors onboard orbiting satellites or produced by climate models. To address this challenge, we must develop a new generation of systems for scientific data managem..
Automatic detection of conserved RNA structure elements in complete RNA virus genomes.
We propose a new method for detecting conserved RNA secondary structures in a family of related RNA sequences. Our method is based on a combination of thermodynamic structure prediction and phylogenetic comparison. In contrast to purely phylogenetic methods, our algorithm can be used for small data sets of approximately 10 sequences, efficiently exploiting the information contained in the sequence variability. The procedure constructs a prediction only for those parts of sequences that are consistent with a single conserved structure. Our implementation produces reasonable consensus structures without user interference. As an example we have analysed the complete HIV-1 and hepatitis C virus (HCV) genomes as well as the small segment of hantavirus. Our method confirms the known structures in HIV-1 and predicts previously unknown conserved RNA secondary structures in HCV
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
The important scientific challenge of understanding global climate change is one that clearly requires the application of knowledge discovery and datamining techniques on a massive scale. Advances in parallel supercomputing technology, enabling high-resolution modeling, as well as in sensor technology, allowing data capture on an unprecedented scale, conspire to overwhelm present-day analysis approaches. We present here early experiences with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets. CONQUEST (CONtent-based Querying in Space and Time) employs a combination of workstations and massively parallel processors (MPP's) to mine geophysical datasets possessing a prominent temporal component. It is designed to enable complex multi-modal interactive querying and knowledge discovery, while simultaneously coping with the extraordinary computational demands posed by the scope of the datasets involved. A..