4,940 research outputs found
Partial Replica Location And Selection For Spatial Datasets
As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
An iterative approach for generating statistically realistic populations of households
Background: Many different simulation frameworks, in different topics, need
to treat realistic datasets to initialize and calibrate the system. A precise
reproduction of initial states is extremely important to obtain reliable
forecast from the model. Methodology/Principal Findings: This paper proposes an
algorithm to create an artificial population where individuals are described by
their age, and are gathered in households respecting a variety of statistical
constraints (distribution of household types, sizes, age of household head,
difference of age between partners and among parents and children). Such a
population is often the initial state of microsimulation or (agent)
individual-based models. To get a realistic distribution of households is often
very important, because this distribution has an impact on the demographic
evolution. Usual techniques from microsimulation approach cross different
sources of aggregated data for generating individuals. In our case the number
of combinations of different households (types, sizes, age of participants)
makes it computationally difficult to use directly such methods. Hence we
developed a specific algorithm to make the problem more easily tractable.
Conclusions/Significance: We generate the populations of two pilot
municipalities in Auvergne region (France), to illustrate the approach. The
generated populations show a good agreement with the available statistical
datasets (not used for the generation) and are obtained in a reasonable
computational time.Comment: 16 oages, 11 figure
A Framework for Developing Real-Time OLAP algorithm using Multi-core processing and GPU: Heterogeneous Computing
The overwhelmingly increasing amount of stored data has spurred researchers
seeking different methods in order to optimally take advantage of it which
mostly have faced a response time problem as a result of this enormous size of
data. Most of solutions have suggested materialization as a favourite solution.
However, such a solution cannot attain Real- Time answers anyhow. In this paper
we propose a framework illustrating the barriers and suggested solutions in the
way of achieving Real-Time OLAP answers that are significantly used in decision
support systems and data warehouses
Linear chemically sensitive electron tomography using DualEELS and dictionary-based compressed sensing
We have investigated the use of DualEELS in elementally sensitive tilt series tomography in the scanning transmission electron microscope. A procedure is implemented using deconvolution to remove the effects of multiple scattering, followed by normalisation by the zero loss peak intensity. This is performed to produce a signal that is linearly dependent on the projected density of the element in each pixel. This method is compared with one that does not include deconvolution (although normalisation by the zero loss peak intensity is still performed). Additionaly, we compare the 3D reconstruction using a new compressed sensing algorithm, DLET, with the well-established SIRT algorithm. VC precipitates, which are extracted from a steel on a carbon replica, are used in this study. It is found that the use of this linear signal results in a very even density throughout the precipitates. However, when deconvolution is omitted, a slight density reduction is observed in the cores of the precipitates (a so-called cupping artefact). Additionally, it is clearly demonstrated that the 3D morphology is much better reproduced using the DLET algorithm, with very little elongation in the missing wedge direction. It is therefore concluded that reliable elementally sensitive tilt tomography using EELS requires the appropriate use of DualEELS together with a suitable reconstruction algorithm, such as the compressed sensing based reconstruction algorithm used here, to make the best use of the limited data volume and signal to noise inherent in core-loss EELS
CrY2H-seq: a massively multiplexed assay for deep-coverage interactome mapping.
Broad-scale protein-protein interaction mapping is a major challenge given the cost, time, and sensitivity constraints of existing technologies. Here, we present a massively multiplexed yeast two-hybrid method, CrY2H-seq, which uses a Cre recombinase interaction reporter to intracellularly fuse the coding sequences of two interacting proteins and next-generation DNA sequencing to identify these interactions en masse. We applied CrY2H-seq to investigate sparsely annotated Arabidopsis thaliana transcription factors interactions. By performing ten independent screens testing a total of 36 million binary interaction combinations, and uncovering a network of 8,577 interactions among 1,453 transcription factors, we demonstrate CrY2H-seq's improved screening capacity, efficiency, and sensitivity over those of existing technologies. The deep-coverage network resource we call AtTFIN-1 recapitulates one-third of previously reported interactions derived from diverse methods, expands the number of known plant transcription factor interactions by three-fold, and reveals previously unknown family-specific interaction module associations with plant reproductive development, root architecture, and circadian coordination
eSPRESSO: topological clustering of single-cell transcriptomics data to reveal informative genes for spatio–temporal architectures of cells
[Background] Bioinformatics capability to analyze spatio–temporal dynamics of gene expression is essential in understanding animal development. Animal cells are spatially organized as functional tissues where cellular gene expression data contain information that governs morphogenesis during the developmental process. Although several computational tissue reconstruction methods using transcriptomics data have been proposed, those methods have been ineffective in arranging cells in their correct positions in tissues or organs unless spatial information is explicitly provided. [Results] This study demonstrates stochastic self-organizing map clustering with Markov chain Monte Carlo calculations for optimizing informative genes effectively reconstruct any spatio–temporal topology of cells from their transcriptome profiles with only a coarse topological guideline. The method, eSPRESSO (enhanced SPatial REconstruction by Stochastic Self-Organizing Map), provides a powerful in silico spatio–temporal tissue reconstruction capability, as confirmed by using human embryonic heart and mouse embryo, brain, embryonic heart, and liver lobule with generally high reproducibility (average max. accuracy = 92.0%), while revealing topologically informative genes, or spatial discriminator genes. Furthermore, eSPRESSO was used for temporal analysis of human pancreatic organoids to infer rational developmental trajectories with several candidate ‘temporal’ discriminator genes responsible for various cell type differentiations. [Conclusions] eSPRESSO provides a novel strategy for analyzing mechanisms underlying the spatio–temporal formation of cellular organizations
- …