534 research outputs found

    The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch

    Get PDF
    Recent and forthcoming advances in instrumentation, and giant new surveys, are creating astronomical data sets that are not amenable to the methods of analysis familiar to astronomers. Traditional methods are often inadequate not merely because of the size in bytes of the data sets, but also because of the complexity of modern data sets. Mathematical limitations of familiar algorithms and techniques in dealing with such data sets create a critical need for new paradigms for the representation, analysis and scientific visualization (as opposed to illustrative visualization) of heterogeneous, multiresolution data across application domains. Some of the problems presented by the new data sets have been addressed by other disciplines such as applied mathematics, statistics and machine learning and have been utilized by other sciences such as space-based geosciences. Unfortunately, valuable results pertaining to these problems are mostly to be found only in publications outside of astronomy. Here we offer brief overviews of a number of concepts, techniques and developments, some "old" and some new. These are generally unknown to most of the astronomical community, but are vital to the analysis and visualization of complex datasets and images. In order for astronomers to take advantage of the richness and complexity of the new era of data, and to be able to identify, adopt, and apply new solutions, the astronomical community needs a certain degree of awareness and understanding of the new concepts. One of the goals of this paper is to help bridge the gap between applied mathematics, artificial intelligence and computer science on the one side and astronomy on the other.Comment: 24 pages, 8 Figures, 1 Table. Accepted for publication: "Advances in Astronomy, special issue "Robotic Astronomy

    Recommender Systems

    Get PDF
    The ongoing rapid expansion of the Internet greatly increases the necessity of effective recommender systems for filtering the abundant information. Extensive research for recommender systems is conducted by a broad range of communities including social and computer scientists, physicists, and interdisciplinary researchers. Despite substantial theoretical and practical achievements, unification and comparison of different approaches are lacking, which impedes further advances. In this article, we review recent developments in recommender systems and discuss the major challenges. We compare and evaluate available algorithms and examine their roles in the future developments. In addition to algorithms, physical aspects are described to illustrate macroscopic behavior of recommender systems. Potential impacts and future directions are discussed. We emphasize that recommendation has a great scientific depth and combines diverse research fields which makes it of interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Hypergraph models of biological networks to identify genes critical to pathogenic viral response

    Get PDF
    BACKGROUND: Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and have shown promise in modeling systems such as protein complexes and metabolic reactions. In this paper we seek to understand how hypergraphs can more faithfully identify, and potentially predict, important genes based on complex relationships inferred from genomic expression data sets. RESULTS: We compiled a novel data set of transcriptional host response to pathogenic viral infections and formulated relationships between genes as a hypergraph where hyperedges represent significantly perturbed genes, and vertices represent individual biological samples with specific experimental conditions. We find that hypergraph betweenness centrality is a superior method for identification of genes important to viral response when compared with graph centrality. CONCLUSIONS: Our results demonstrate the utility of using hypergraphs to represent complex biological systems and highlight central important responses in common to a variety of highly pathogenic viruses
    corecore