29 research outputs found

    Exploration of Parameter Spaces in a Virtual Observatory

    Get PDF
    Like every other field of intellectual endeavor, astronomy is being revolutionised by the advances in information technology. There is an ongoing exponential growth in the volume, quality, and complexity of astronomical data sets, mainly through large digital sky surveys and archives. The Virtual Observatory (VO) concept represents a scientific and technological framework needed to cope with this data flood. Systematic exploration of the observable parameter spaces, covered by large digital sky surveys spanning a range of wavelengths, will be one of the primary modes of research with a VO. This is where the truly new discoveries will be made, and new insights be gained about the already known astronomical objects and phenomena. We review some of the methodological challenges posed by the analysis of large and complex data sets expected in the VO-based research. The challenges are driven both by the size and the complexity of the data sets (billions of data vectors in parameter spaces of tens or hundreds of dimensions), by the heterogeneity of the data and measurement errors, including differences in basic survey parameters for the federated data sets (e.g., in the positional accuracy and resolution, wavelength coverage, time baseline, etc.), various selection effects, as well as the intrinsic clustering properties (functional form, topology) of the data distributions in the parameter spaces of observed attributes. Answering these challenges will require substantial collaborative efforts and partnerships between astronomers, computer scientists, and statisticians.Comment: Invited review, 10 pages, Latex file with 4 eps figures, style files included. To appear in Proc. SPIE, v. 4477 (2001

    Data-Mining a Large Digital Sky Survey: From the Challenges to the Scientific Results

    Get PDF
    The analysis and an efficient scientific exploration of the Digital Palomar Observatory Sky Survey (DPOSS) represents a major technical challenge. The input data set consists of 3 Terabytes of pixel information, and contains a few billion sources. We describe some of the specific scientific problems posed by the data, including searches for distant quasars and clusters of galaxies, and the data-mining techniques we are exploring in addressing them. Machine-assisted discovery methods may become essential for the analysis of such multi-Terabyte data sets. New and future approaches involve unsupervised classification and clustering analysis in the Giga-object data space, including various Bayesian techniques. In addition to the searches for known types of objects in this data base, these techniques may also offer the possibility of discovering previously unknown, rare types of astronomical objects.Comment: Invited paper, to appear in Applications of Digital Image Processing XX, ed. A. Tescher, Proc. S.P.I.E. vol. 3164, in press; 10 pages, a self-contained TeX file, and 3 separate postscript figure

    Distances and classification of amino acids for different protein secondary structures

    Full text link
    Window profiles of amino acids in protein sequences are taken as a description of the amino acid environment. The relative entropy or Kullback-Leibler distance derived from profiles is used as a measure of dissimilarity for comparison of amino acids and secondary structure conformations. Distance matrices of amino acid pairs at different conformations are obtained, which display a non-negligible dependence of amino acid similarity on conformations. Based on the conformation specific distances clustering analysis for amino acids is conducted.Comment: 15 pages, 8 figure

    Exploration of Large Digital Sky Surveys

    Get PDF
    We review some of the scientific opportunities and technical challenges posed by the exploration of the large digital sky surveys, in the context of a Virtual Observatory (VO). The VO paradigm will profoundly change the way observational astronomy is done. Clustering analysis techniques can be used to discover samples of rare, unusual, or even previously unknown types of astronomical objects and phenomena. Exploration of the previously poorly probed portions of the observable parameter space are especially promising. We illustrate some of the possible types of studies with examples drawn from DPOSS; much more complex and interesting applications are forthcoming. Development of the new tools needed for an efficient exploration of these vast data sets requires a synergy between astronomy and information sciences, with great potential returns for both fields.Comment: To appear in: Mining the Sky, eds. A. Banday et al., ESO Astrophysics Symposia, Berlin: Springer Verlag, in press (2001). Latex file, 18 pages, 6 encapsulated postscript figures, style files include

    Mining scientific data

    No full text

    QUEST: Content-based Access to Geophysical Databases

    No full text
    A major challenge facing geophysical science today is the unavailability of high-level analysis tools with which to study the massive amount of data produced by sensors or long simulations of climate models. As part of a NASA HPCC Grand Challenge effort [Mun92], we have developed a prototype environment called QUEST to provide content-based query access to massive datasets used in geophysical applications. QUEST employs work stations as well as massively parallel processors to produce spatio-temporal features that are used as high-level indexes into terabyte datasets. This paper discusses our continued development of the QUEST environment. 1 Introduction A critical challenge facing geophysical science today is the unavailability of high-level analysis tools with which to study the massive amount of information captured by sensors onboard orbiting satellites or produced by climate models. To address this challenge, we must develop a new generation of systems for scientific data managem..

    Automatic detection of conserved RNA structure elements in complete RNA virus genomes.

    Get PDF
    We propose a new method for detecting conserved RNA secondary structures in a family of related RNA sequences. Our method is based on a combination of thermodynamic structure prediction and phylogenetic comparison. In contrast to purely phylogenetic methods, our algorithm can be used for small data sets of approximately 10 sequences, efficiently exploiting the information contained in the sequence variability. The procedure constructs a prediction only for those parts of sequences that are consistent with a single conserved structure. Our implementation produces reasonable consensus structures without user interference. As an example we have analysed the complete HIV-1 and hepatitis C virus (HCV) genomes as well as the small segment of hantavirus. Our method confirms the known structures in HIV-1 and predicts previously unknown conserved RNA secondary structures in HCV

    Fast Spatio-Temporal Data Mining of Large Geophysical Datasets

    No full text
    The important scientific challenge of understanding global climate change is one that clearly requires the application of knowledge discovery and datamining techniques on a massive scale. Advances in parallel supercomputing technology, enabling high-resolution modeling, as well as in sensor technology, allowing data capture on an unprecedented scale, conspire to overwhelm present-day analysis approaches. We present here early experiences with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets. CONQUEST (CONtent-based Querying in Space and Time) employs a combination of workstations and massively parallel processors (MPP's) to mine geophysical datasets possessing a prominent temporal component. It is designed to enable complex multi-modal interactive querying and knowledge discovery, while simultaneously coping with the extraordinary computational demands posed by the scope of the datasets involved. A..
    corecore