485 research outputs found
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
MPI-Vector-IO: Parallel I/O and Partitioning for Geospatial Vector Data
In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO 1 , a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms
Study of Scalable Declustering Algorithms for Parallel Grid Files
Efficient storage and retrieval of large multidimensional datasets is
an important concern for large-scale scientific computations such as
long-running time-dependent simulations which periodically generate
snapshots of the state.
The main challenge for efficiently handling such datasets
is to minimize response time for multidimensional range queries.
The grid file is one of the well known access methods for
multidimensional and spatial data.
We investigate effective and scalable declustering techniques
for grid files with the primary goal of minimizing response time
and the secondary goal of maximizing the fairness of data distribution.
The main contributions of this paper are (1) analytic and experimental
evaluation of existing index-based declustering techniques and their
extensions for grid files, and (2) development of a proximity-based
declustering algorithm called {\em minimax} which is experimentally
shown to scale and to consistently achieve better response time
compared to available algorithms while maintaining perfect disk distribution.
(Also cross-referenced as UMIACS-TR-96-4
Soil Spatial Scaling: Modelling variability of soil properties across scales using legacy data
Understanding how soil variability changes with spatial scale is critical to our ability to understand and model soil processes at scales relevant to decision makers. This thesis uses legacy data to address the ongoing challenge of understanding soil spatial variability in a number of complementary ways. We use a range of information: precision agriculture studies; compiled point datasets; and remotely observed raster datasets. We use classical geostatistics, but introduce a new framework for comparing variability of spatial properties across scales. My thesis considers soil spatial variability from a number of geostatistical angles. We find the following: • Field scale variograms show differing variance across several magnitudes. Further work is required to ensure consistency between survey design, experimental methodology and statistical methodology if these results are to become useful for comparison. • Declustering is a useful tool to deal with the patchy design of legacy data. It is not a replacement for an evenly distributed dataset, but it does allow the use of legacy data which would otherwise have limited utility. • A framework which allows ‘roughness’ to be expressed as a continuous variable appears to fit the data better than the mono-fractal or multi-fractal framework generally associated with multi–scale modelling of soil spatial variability. • Soil appears to have a similar degree of stochasticity to short range topographic variability, and a higher degree of stochasticity at short ranges (less than 10km and 100km) than vegetation and Radiometrics respectively. • At longer ranges of variability (i.e. around 100km) only rainfall and height above sea level show distinctly different stochasticity. • Global variograms show strong isotropy, unlike the variograms for the Australian continent
Load Balancing Algorithms for Parallel Spatial Join on HPC Platforms
Geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of geospatial computations and analytics on large scale datasets, parallel processing is necessary. To exploit fine-grained parallel processing on large scale compute clusters, partitioning of skewed datasets in a load-balanced way is challenging. The workload in spatial join is data dependent and highly irregular. Moreover, wide variation in the size and density of geometries from one region of the map to another, further exacerbates the load imbalance. This dissertation focuses on spatial join operation used in Geographic Information Systems (GIS) and spatial databases, where the inputs are two layers of geospatial data, and the output is a combination of the two layers according to join predicate.This dissertation introduces a novel spatial data partitioning algorithm geared towards load balancing the parallel spatial join processing. Unlike existing partitioning techniques, the proposed partitioning algorithm divides the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. This workload partitioning algorithm has been evaluated on a high-performance computing system using real-world datasets. An intermediate output-sensitive duplication avoidance technique is proposed that decreases the external memory space requirement for storing spatial join candidates across the partitions. GPU acceleration is used to further reduce the spatial partitioning runtime. For dynamic load balancing in spatial join, a novel framework for fine-grained work stealing is presented. This framework is efficient and NUMA-aware. Performance improvements are demonstrated on shared and distributed memory architectures using threads and message passing. Experimental results show effective mitigation of data skew. The framework supports a variety of spatial join predicates and spatial overlay using partitioned and un-partitioned datasets
Utilizing query logs for data replication and placement in big data applications
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2012.Thesis (Ph. D.) -- Bilkent University, 2012.Includes bibliographical refences.The growth in the amount of data in todays computing problems and the level
of parallelism dictated by the large-scale computing economics necessitates highlevel
parallelism for many applications. This parallelism is generally achieved
via data-parallel solutions that require effective data clustering (partitioning) or
declustering schemes (depending on the application requirements). In addition
to data partitioning/declustering, data replication, which is used for data availability
and increased performance, has also become an inherent feature of many
applications. The data partitioning/declustering and data replication problems
are generally addressed separately. This thesis is centered around the idea of
performing data replication and data partitioning/declustering simultenously to
obtain replicated data distributions that yield better parallelism. To this end,
we utilize query-logs to propose replicated data distribution solutions and extend
the well known Fiduccia-Mattheyses (FM) iterative improvement algorithm
so that it can be used to generate replicated partitioning/declustering of data.
For the replicated declustering problem, we propose a novel replicated declustering
scheme that utilizes query logs to improve the performance of a parallel
database system. We also extend our replicated declustering scheme and propose
a novel replicated re-declustering scheme such that in the face of drastic
query pattern changes or server additions/removals from the parallel database
system, new declustering solutions that require low migration overheads can be
computed. For the replicated partitioning problem, we show how to utilize an
effective single-phase replicated partitioning solution in two well-known applications
(keyword-based search and Twitter). For these applications, we provide the
algorithmic solutions we had to devise for solving the problems that replication
brings, the engineering decisions we made so as to obtain the greatest benefits
from the proposed data distribution, and the implementation details for realistic
systems. Obtained results indicate that utilizing query-logs and performing replication and partitioning/declustering in a single phase improves parallel performance.Türk, AtaPh.D
SIN-dependent phosphoinhibition of formin multimerization controls fission yeast cytokinesis
Many eukaryotes accomplish cell division by building and constricting a medial actomyosin-based cytokinetic ring (CR). In Schizosaccharomyces pombe, a Hippo-related signaling pathway termed the septation initiation network (SIN) controls CR formation, maintenance, and constriction. However, how the SIN regulates integral CR components was unknown. Here, we identify the essential cytokinetic formin Cdc12 as a key CR substrate of SIN kinase Sid2. Eliminating Sid2-mediated Cdc12 phosphorylation leads to persistent Cdc12 clustering, which prevents CR assembly in the absence of anillin-like Mid1 and causes CRs to collapse when cytokinesis is delayed. Molecularly, Sid2 phosphorylation of Cdc12 abrogates multimerization of a previously unrecognized Cdc12 domain that confers F-actin bundling activity. Taken together, our findings identify a SIN-triggered oligomeric switch that modulates cytokinetic formin function, revealing a novel mechanism of actin cytoskeleton regulation during cell division. © 2013 Bohnert et al
Simulated Annealing
The book contains 15 chapters presenting recent contributions of top researchers working with Simulated Annealing (SA). Although it represents a small sample of the research activity on SA, the book will certainly serve as a valuable tool for researchers interested in getting involved in this multidisciplinary field. In fact, one of the salient features is that the book is highly multidisciplinary in terms of application areas since it assembles experts from the fields of Biology, Telecommunications, Geology, Electronics and Medicine
Partial Replica Location And Selection For Spatial Datasets
As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively
- …