15 research outputs found
Partial Replica Location And Selection For Spatial Datasets
As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively
Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure
A common feature across many science and engineering applications is the
amount and diversity of data and computation that must be integrated to yield
insights. Data sets are growing larger and becoming distributed; and their
location, availability and properties are often time-dependent. Collectively,
these characteristics give rise to dynamic distributed data-intensive
applications. While "static" data applications have received significant
attention, the characteristics, requirements, and software systems for the
analysis of large volumes of dynamic, distributed data, and data-intensive
applications have received relatively less attention. This paper surveys
several representative dynamic distributed data-intensive application
scenarios, provides a common conceptual framework to understand them, and
examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure
DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS
Scientific data analysis applications require large scale computing power to
effectively service client queries and also require large storage repositories
for datasets that are generated continually from sensors and simulations.
These scientific datasets are growing in size every day, and are becoming truly
enormous. The goal of this dissertation is to provide efficient multidimensional
indexing techniques that aid in navigating distributed scientific datasets.
In this dissertation, we show significant improvements in accessing
distributed large scientific datasets.
The first approach we took to improve access to subsets of large
multidimensional scientific datasets, was data chunking. The contents of
scientific data files typically are a collection of multidimensional arrays,
along with the corresponding metadata. Data chunking groups data elements into
small chunks of a fixed, but data-specific, size to take advantage of
spatio-temporal locality since it is not efficient to index individual data
elements of large scientific datasets.
The second approach was the design of an efficient multidimensional index for
scientific datasets. This work investigates how existing multidimensional
indexing structures perform on chunked scientific datasets, and compares their
performance with that of our own indexing structure, SH-trees. Since R-trees
were proposed, various multidimensional indexing structures have been proposed.
However, there are a relatively small number of studies focused on improving
the performance of indexing geographically distributed datasets, especially
across heterogeneous machines. As a third approach, in an attempt to
accelerate indexing performance for distributed datasets, we proposed several
distributed multidimensional indexing schemes: replicated centralized indexing,
hierarchical two level indexing, and decentralized two level indexing.
Our experimental results show that great performance improvements
are gained from distribution of multidimensional index. However, the design
choices for distributed indexing, such as replication, partitioning, and
decentralization, must be carefully considered since they may decrease the overall
performance in certain situations. Therefore, this work provides performance
guidelines to aid in selecting the best distributed multidimensional indexing
scheme for various systems and applications. Finally, we describe how a
distributed multidimensional indexing scheme can be used by a distributed
multiple query optimization middleware as a case-study application to
generate better query plans by leveraging information about the contents of
remote caches
From online social network analysis to a user-centric private sharing system
Online social networks (OSNs) have become a massive repository of data constructed
from individuals’ inputs: posts, photos, feedbacks, locations, etc. By
analyzing such data, meaningful knowledge is generated that can affect individuals’
beliefs, desires, happiness and choices—a data circulation started from individuals
and ended in individuals! The OSN owners, as the one authority having full control
over the stored data, make the data available for research, advertisement and other
purposes. However, the individuals are missed in this circle while they generate
the data and shape the OSN structure.
In this thesis, we started by introducing approximation algorithms for finding
the most influential individuals in a social graph and modeling the spread of
information. To do so, we considered the communities of individuals that are
shaped in a social graph. The social graph is extracted from the data stored and
controlled centrally, which can cause privacy breaches and lead to individuals’
concerns. Therefore, we introduced UPSS: the user-centric private sharing system, in
which the individuals are considered as the real data owners and provides secure
and private data sharing on untrusted servers.
The UPSS’s public API allows the application developers to implement applications
as diverse as OSNs, document redaction systems with integrity properties,
censorship-resistant systems, health care auditing systems, distributed version control
systems with flexible access controls and a filesystem in userspace. Accessing
users’ data is possible only with explicit user consent. We implemented the two
later cases to show the applicability of UPSS.
Supporting different storage models by UPSS enables us to have a local, remote
and global filesystem in userspace with one unique core filesystem implementation
and having it mounted with different block stores.
By designing and implementing UPSS, we show that security and privacy
can be addressed at the same time in the systems that need selective, secure and collaborative information sharing without requiring complete trust
Using space and attribute partitioned partial replicas for data subsetting and aggregation queries
ABSTRACT Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. In this paper we investigate methods for efficient execution of queries when replicas of a dataset exist; we assume the replicas have already been created and do not target the replica creation problem. We propose a cost model and algorithm for combined use of space partitioned and attribute partitioned replicas for executing data subsetting range queries. We extend the cost model and propose a greedy algorithm to address range queries with aggregation operations. The extended replica selection algorithm allows uneven partitioning of replicas across storage nodes. Different replicas can be partitioned across different subsets of storage nodes. We have implemented these techniques as part of an automatic data virtualization system and have evaluated the benefits of our techniques using this system. We demonstrate the efficacy of the algorithms on parallel machines using queries on datasets from oil reservoir simulation studies and satellite data processing applications
Sensor web geoprocessing on the grid
Recent standardisation initiatives in the fields of grid computing and geospatial sensor middleware provide an exciting opportunity for the composition of large scale geospatial monitoring and prediction systems from existing components. Sensor middleware standards are paving the way for the emerging sensor web which is envisioned to make millions of geospatial sensors and their data publicly accessible by providing discovery, task and query functionality over the internet. In a similar fashion, concurrent development is taking place in the field of grid computing whereby the virtualisation of computational and data storage resources using middleware abstraction provides a framework to share computing resources. Sensor web and grid computing share a common vision of world-wide connectivity and in their current form they are both realised using web services as the underlying technological framework. The integration of sensor web and grid computing middleware using open standards is expected to facilitate interoperability and scalability in near real-time geoprocessing systems. The aim of this thesis is to develop an appropriate conceptual and practical framework in which open standards in grid computing, sensor web and geospatial web services can be combined as a technological basis for the monitoring and prediction of geospatial phenomena in the earth systems domain, to facilitate real-time decision support. The primary topic of interest is how real-time sensor data can be processed on a grid computing architecture. This is addressed by creating a simple typology of real-time geoprocessing operations with respect to grid computing architectures. A geoprocessing system exemplar of each geoprocessing operation in the typology is implemented using contemporary tools and techniques which provides a basis from which to validate the standards frameworks and highlight issues of scalability and interoperability. It was found that it is possible to combine standardised web services from each of these aforementioned domains despite issues of interoperability resulting from differences in web service style and security between specifications. A novel integration method for the continuous processing of a sensor observation stream is suggested in which a perpetual processing job is submitted as a single continuous compute job. Although this method was found to be successful two key challenges remain; a mechanism for consistently scheduling real-time jobs within an acceptable time-frame must be devised and the tradeoff between efficient grid resource utilisation and processing latency must be balanced. The lack of actual implementations of distributed geoprocessing systems built using sensor web and grid computing has hindered the development of standards, tools and frameworks in this area. This work provides a contribution to the small number of existing implementations in this field by identifying potential workflow bottlenecks in such systems and gaps in the existing specifications. Furthermore it sets out a typology of real-time geoprocessing operations that are anticipated to facilitate the development of real-time geoprocessing software.EThOS - Electronic Theses Online ServiceEngineering and Physical Sciences Research Council (EPSRC) : School of Civil Engineering & Geosciences, Newcastle UniversityGBUnited Kingdo
Proceedings of the 4th International Conference on Principles and Practices of Programming in Java
This book contains the proceedings of the 4th international conference on principles and practices of programming in Java. The conference focuses on the different aspects of the Java programming language and its applications
Combining SOA and BPM Technologies for Cross-System Process Automation
This paper summarizes the results of an industry case study that introduced a cross-system business process automation solution based on a combination of SOA and BPM standard technologies (i.e., BPMN, BPEL, WSDL). Besides discussing major weaknesses of the existing, custom-built, solution and comparing them against experiences with the developed prototype, the paper presents a course of action for transforming the current solution into the proposed solution. This includes a general approach, consisting of four distinct steps, as well as specific action items that are to be performed for every step. The discussion also covers language and tool support and challenges arising from the transformation