1,432 research outputs found

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    MOON: MapReduce On Opportunistic eNvironments

    Get PDF
    Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments

    A Prediction-Based Replication Algorithm for Improving Data Availability in Frid Environment

    Get PDF
    Data replication is a key optimization technique for reducing access latency and managing large data by storing replica of data in a wisely manner. In this paper, we propose a data replication algorithm, called the Prediction-Base Dynamic Replication (PBDR) algorithm that improves file access time. Restricted by the storage capacity, it is essential to design an effective strategy for the replication replacement task. PBDR deletes files by considering four important factors: the number of requests for the replica in the future times, availability, the size of the replica and the last time the replica was requested. Also, it can minimize access latency by selecting the best replica when various sites hold replicas of datasets. The algorithm is simulated using a data grid simulator, OptorSim, developed by European Data Grid projects. The experiment results show that PBDR strategy gives better performance compared to the other algorithms and prevents unnecessary creation of replica which leads to efficient storage usage

    Design and Implementation of a Distributed Middleware for Parallel Execution of Legacy Enterprise Applications

    Get PDF
    A typical enterprise uses a local area network of computers to perform its business. During the off-working hours, the computational capacities of these networked computers are underused or unused. In order to utilize this computational capacity an application has to be recoded to exploit concurrency inherent in a computation which is clearly not possible for legacy applications without any source code. This thesis presents the design an implementation of a distributed middleware which can automatically execute a legacy application on multiple networked computers by parallelizing it. This middleware runs multiple copies of the binary executable code in parallel on different hosts in the network. It wraps up the binary executable code of the legacy application in order to capture the kernel level data access system calls and perform them distributively over multiple computers in a safe and conflict free manner. The middleware also incorporates a dynamic scheduling technique to execute the target application in minimum time by scavenging the available CPU cycles of the hosts in the network. This dynamic scheduling also supports the CPU availability of the hosts to change over time and properly reschedule the replicas performing the computation to minimize the execution time. A prototype implementation of this middleware has been developed as a proof of concept of the design. This implementation has been evaluated with a few typical case studies and the test results confirm that the middleware works as expected

    Replica maintenance strategy for data grid

    Get PDF
    Data Grid is an infrastructure that manages huge amount of data files, and provides intensive computational resources across geographically distributed collaboration.Increasing the performance of such system can be achieved by improving the overall resource usage, which includes network and storage resources.Improving network resource usage is achieved by good utilization of network bandwidth that is considered as an important factor affecting job execution time.Meanwhile, improving storage resource usage is achieved by good utilization of storage space usage. Data replication is one of the methods used to improve the performance of data access in distributed systems by replicating multiple copies of data files in the distributed sites.Having distributed the replicas to various locations, they need to be monitored.As a result of dynamic changes in the data grid environment, some of the replicas need to be relocated.In this paper we proposed a maintenance replica placement strategy termed as Unwanted Replica Deletion Strategy (URDS) as a part of Replica maintenance service.The main purpose of the proposed strategy is to find the placement of unwanted replicas to be deleted.OptorSim is used to evaluate the performance of the proposed strategy. The simulation results show that URDS requires less execution time and consumes less network usage and has a best utilization of storage space usage compared to existing approaches

    Implementation of Sub-Grid-Federation Model for Performance Improvement in Federated Data Grid

    Get PDF
    In this work, a new model for federation data grid system called Sub-Grid-Federation was designed to improve access latency by accessing data from the nearest possible sites. The strategy in optimising data access was based on the process of searching into the area identified as ‘Network Core Area’ (NCA). The performance of access latency in Sub-Grid-Federation was tested based on the mathematical proving and simulated using OptorSim simulator. Four case studies were carried out and tested in Optimal Downloading Replication Strategy (ODRS) and the Sub-Grid-Federation. The results show that Sub-Grid-Federation is 20% better in terms of access latency and 21% better in terms of reducing remotes sites access compared to ODRS. The results indicate that the Sub-Grid-Federation is a better alternative for the implementation of collaboration and data sharing in data grid system.                                                                                    Keywords: Data grid, replication, scheduling, access latenc
    corecore