4,940 research outputs found

    Partial Replica Location And Selection For Spatial Datasets

    Get PDF
    As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    An iterative approach for generating statistically realistic populations of households

    Get PDF
    Background: Many different simulation frameworks, in different topics, need to treat realistic datasets to initialize and calibrate the system. A precise reproduction of initial states is extremely important to obtain reliable forecast from the model. Methodology/Principal Findings: This paper proposes an algorithm to create an artificial population where individuals are described by their age, and are gathered in households respecting a variety of statistical constraints (distribution of household types, sizes, age of household head, difference of age between partners and among parents and children). Such a population is often the initial state of microsimulation or (agent) individual-based models. To get a realistic distribution of households is often very important, because this distribution has an impact on the demographic evolution. Usual techniques from microsimulation approach cross different sources of aggregated data for generating individuals. In our case the number of combinations of different households (types, sizes, age of participants) makes it computationally difficult to use directly such methods. Hence we developed a specific algorithm to make the problem more easily tractable. Conclusions/Significance: We generate the populations of two pilot municipalities in Auvergne region (France), to illustrate the approach. The generated populations show a good agreement with the available statistical datasets (not used for the generation) and are obtained in a reasonable computational time.Comment: 16 oages, 11 figure

    A Framework for Developing Real-Time OLAP algorithm using Multi-core processing and GPU: Heterogeneous Computing

    Full text link
    The overwhelmingly increasing amount of stored data has spurred researchers seeking different methods in order to optimally take advantage of it which mostly have faced a response time problem as a result of this enormous size of data. Most of solutions have suggested materialization as a favourite solution. However, such a solution cannot attain Real- Time answers anyhow. In this paper we propose a framework illustrating the barriers and suggested solutions in the way of achieving Real-Time OLAP answers that are significantly used in decision support systems and data warehouses

    Linear chemically sensitive electron tomography using DualEELS and dictionary-based compressed sensing

    Get PDF
    We have investigated the use of DualEELS in elementally sensitive tilt series tomography in the scanning transmission electron microscope. A procedure is implemented using deconvolution to remove the effects of multiple scattering, followed by normalisation by the zero loss peak intensity. This is performed to produce a signal that is linearly dependent on the projected density of the element in each pixel. This method is compared with one that does not include deconvolution (although normalisation by the zero loss peak intensity is still performed). Additionaly, we compare the 3D reconstruction using a new compressed sensing algorithm, DLET, with the well-established SIRT algorithm. VC precipitates, which are extracted from a steel on a carbon replica, are used in this study. It is found that the use of this linear signal results in a very even density throughout the precipitates. However, when deconvolution is omitted, a slight density reduction is observed in the cores of the precipitates (a so-called cupping artefact). Additionally, it is clearly demonstrated that the 3D morphology is much better reproduced using the DLET algorithm, with very little elongation in the missing wedge direction. It is therefore concluded that reliable elementally sensitive tilt tomography using EELS requires the appropriate use of DualEELS together with a suitable reconstruction algorithm, such as the compressed sensing based reconstruction algorithm used here, to make the best use of the limited data volume and signal to noise inherent in core-loss EELS

    CrY2H-seq: a massively multiplexed assay for deep-coverage interactome mapping.

    Get PDF
    Broad-scale protein-protein interaction mapping is a major challenge given the cost, time, and sensitivity constraints of existing technologies. Here, we present a massively multiplexed yeast two-hybrid method, CrY2H-seq, which uses a Cre recombinase interaction reporter to intracellularly fuse the coding sequences of two interacting proteins and next-generation DNA sequencing to identify these interactions en masse. We applied CrY2H-seq to investigate sparsely annotated Arabidopsis thaliana transcription factors interactions. By performing ten independent screens testing a total of 36 million binary interaction combinations, and uncovering a network of 8,577 interactions among 1,453 transcription factors, we demonstrate CrY2H-seq's improved screening capacity, efficiency, and sensitivity over those of existing technologies. The deep-coverage network resource we call AtTFIN-1 recapitulates one-third of previously reported interactions derived from diverse methods, expands the number of known plant transcription factor interactions by three-fold, and reveals previously unknown family-specific interaction module associations with plant reproductive development, root architecture, and circadian coordination

    eSPRESSO: topological clustering of single-cell transcriptomics data to reveal informative genes for spatio–temporal architectures of cells

    Get PDF
    [Background] Bioinformatics capability to analyze spatio–temporal dynamics of gene expression is essential in understanding animal development. Animal cells are spatially organized as functional tissues where cellular gene expression data contain information that governs morphogenesis during the developmental process. Although several computational tissue reconstruction methods using transcriptomics data have been proposed, those methods have been ineffective in arranging cells in their correct positions in tissues or organs unless spatial information is explicitly provided. [Results] This study demonstrates stochastic self-organizing map clustering with Markov chain Monte Carlo calculations for optimizing informative genes effectively reconstruct any spatio–temporal topology of cells from their transcriptome profiles with only a coarse topological guideline. The method, eSPRESSO (enhanced SPatial REconstruction by Stochastic Self-Organizing Map), provides a powerful in silico spatio–temporal tissue reconstruction capability, as confirmed by using human embryonic heart and mouse embryo, brain, embryonic heart, and liver lobule with generally high reproducibility (average max. accuracy = 92.0%), while revealing topologically informative genes, or spatial discriminator genes. Furthermore, eSPRESSO was used for temporal analysis of human pancreatic organoids to infer rational developmental trajectories with several candidate ‘temporal’ discriminator genes responsible for various cell type differentiations. [Conclusions] eSPRESSO provides a novel strategy for analyzing mechanisms underlying the spatio–temporal formation of cellular organizations
    corecore