275 research outputs found
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
Exploring heterogeneity of unreliable machines for p2p backup
P2P architecture is a viable option for enterprise backup. In contrast to
dedicated backup servers, nowadays a standard solution, making backups directly
on organization's workstations should be cheaper (as existing hardware is
used), more efficient (as there is no single bottleneck server) and more
reliable (as the machines are geographically dispersed).
We present the architecture of a p2p backup system that uses pairwise
replication contracts between a data owner and a replicator. In contrast to
standard p2p storage systems using directly a DHT, the contracts allow our
system to optimize replicas' placement depending on a specific optimization
strategy, and so to take advantage of the heterogeneity of the machines and the
network. Such optimization is particularly appealing in the context of backup:
replicas can be geographically dispersed, the load sent over the network can be
minimized, or the optimization goal can be to minimize the backup/restore time.
However, managing the contracts, keeping them consistent and adjusting them in
response to dynamically changing environment is challenging.
We built a scientific prototype and ran the experiments on 150 workstations
in the university's computer laboratories and, separately, on 50 PlanetLab
nodes. We found out that the main factor affecting the quality of the system is
the availability of the machines. Yet, our main conclusion is that it is
possible to build an efficient and reliable backup system on highly unreliable
machines (our computers had just 13% average availability)
Implications of query caching for JXTA peers
This dissertation studies the caching of queries and how to cache in an efficient way, so that retrieving previously accessed data does not need any intermediary nodes between the data-source peer and the querying peer in super-peer P2P network. A precise algorithm was devised that demonstrated how queries can be deconstructed to provide greater flexibility for reusing their constituent elements. It showed how subsequent queries can make use of more than one previous query and any part of those queries to reconstruct direct data communication with one or more source peers that have supplied data previously. In effect, a new query can search and exploit the entire cached list of queries to construct the list of the data locations it requires that might match any locations previously accessed. The new method increases the likelihood of repeat queries being able to reuse earlier queries and provides a viable way of by-passing shared data indexes in structured networks. It could also increase the efficiency of unstructured networks by reducing traffic and the propensity for network flooding. In addition, performance evaluation for predicting query routing performance by using a UML sequence diagram is introduced. This new method of performance evaluation provides designers with information about when it is most beneficial to use caching and how the peer connections can optimize its exploitation
Future of networking is the future of Big Data, The
2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science
The ViP2P Platform: XML Views in P2P
The growing volumes of XML data sources on the Web or produced by
enterprises, organizations etc. raise many performance challenges for data
management applications. In this work, we are concerned with the distributed,
peer-to-peer management of large corpora of XML documents, based on distributed
hash table (or DHT, in short) overlay networks. We present ViP2P (standing for
Views in Peer-to-Peer), a distributed platform for sharing XML documents based
on a structured P2P network infrastructure (DHT). At the core of ViP2P stand
distributed materialized XML views, defined by arbitrary XML queries, filled in
with data published anywhere in the network, and exploited to efficiently
answer queries issued by any network peer. ViP2P allows user queries to be
evaluated over XML documents published by peers in two modes. First, a
long-running subscription mode, when a query can be registered in the system
and receive answers incrementally when and if published data matches the query.
Second, queries can also be asked in an ad-hoc, snapshot mode, where results
are required immediately and must be computed based on the results of other
long-running, subscription queries. ViP2P innovates over other similar
DHT-based XML sharing platforms by using a very expressive structured XML query
language. This expressivity leads to a very flexible distribution of XML
content in the ViP2P network, and to efficient snapshot query execution. ViP2P
has been tested in real deployments of hundreds of computers. We present the
platform architecture, its internal algorithms, and demonstrate its efficiency
and scalability through a set of experiments. Our experimental results outgrow
by orders of magnitude similar competitor systems in terms of data volumes,
network size and data dissemination throughput.Comment: RR-7812 (2011
Managing scientific data with named data networking
Many scientific domains, such as climate science and High Energy Physics (HEP), have data management requirements that are not well supported by the IP network architecture. Named Data Networking (NDN) is a new network architecture whose service model is better aligned with the needs of data-oriented applications. NDN provides features such as best-location retrieval, caching, load sharing, and transparent failover that would otherwise be painstakingly (re-)implemented by each application using point-to-point semantics in an IP network.
We present the first scientific data management application designed and implemented on top of NDN. We use this application to manage climate and HEP data over a dedicated, high-performance, testbed. Our application has two main components: a UI for dataset discovery queries and a federation of synchronized name catalogs. We show how NDN primitives can be used to implement common data management operations such as publishing, search, efficient retrieval, and publication access control
Optimal Content Prefetching in NDN Vehicle-to-Infrastructure Scenario
Data replication and in-network storage are two basic principles of the Information Centric Networking (ICN) framework in which caches spread out in the network can be used to store the most popular contents. This work shows how one of the ICN architectures, the Named Data Networking (NDN), with content pre-fetching can maximize the probability that a user retrieves the desired content in a Vehicle-to-Infrastructure scenario. We give an ILP formulation of the problem of optimally distributing content in the network nodes while accounting for the available storage capacity and the available link capacity. The optimization framework is then leveraged to evaluate the impact on content retrievability of topology- and network-related parameters as the number and mobility models of moving users, the size of the content catalog and the location of the available caches. Moreover, we show how the proposed model can be modified to find the minimum storage occupancy to achieve a given content retrievability level. The results obtained from the optimization model are finally validated against a Name Data Networking architecture through simulations in ndnSIM
- …