20 research outputs found

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    Building A High Performance Parallel File System Using Grid Datafarm and ROOT I/O

    Full text link
    Sheer amount of petabyte scale data foreseen in the LHC experiments require a careful consideration of the persistency design and the system design in the world-wide distributed computing. Event parallelism of the HENP data analysis enables us to take maximum advantage of the high performance cluster computing and networking when we keep the parallelism both in the data processing phase, in the data management phase, and in the data transfer phase. A modular architecture of FADS/ Goofy, a versatile detector simulation framework for Geant4, enables an easy choice of plug-in facilities for persistency technologies such as Objectivity/DB and ROOT I/O. The framework is designed to work naturally with the parallel file system of Grid Datafarm (Gfarm). FADS/Goofy is proven to generate 10^6 Geant4-simulated Atlas Mockup events using a 512 CPU PC cluster. The data in ROOT I/O files is replicated using Gfarm file system. The histogram information is collected from the distributed ROOT files. During the data replication it has been demonstrated to achieve more than 2.3 Gbps data transfer rate between the PC clusters over seven participating PC clusters in the United States and in Japan.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 4 pages, PDF. PSN TUDT01

    Replica Creation Algorithm for Data Grids

    Get PDF
    Data grid system is a data management infrastructure that facilitates reliable access and sharing of large amount of data, storage resources, and data transfer services that can be scaled across distributed locations. This thesis presents a new replication algorithm that improves data access performance in data grids by distributing relevant data copies around the grid. The new Data Replica Creation Algorithm (DRCM) improves performance of data grid systems by reducing job execution time and making the best use of data grid resources (network bandwidth and storage space). Current algorithms focus on number of accesses in deciding which file to replicate and where to place them, which ignores resources’ capabilities. DRCM differs by considering both user and resource perspectives; strategically placing replicas at locations that provide the lowest transfer cost. The proposed algorithm uses three strategies: Replica Creation and Deletion Strategy (RCDS), Replica Placement Strategy (RPS), and Replica Replacement Strategy (RRS). DRCM was evaluated using network simulation (OptorSim) based on selected performance metrics (mean job execution time, efficient network usage, average storage usage, and computing element usage), scenarios, and topologies. Results revealed better job execution time with lower resource consumption than existing approaches. This research contributes replication strategies embodied in one algorithm that enhances data grid performance, capable of making a decision on creating or deleting more than one file during same decision. Furthermore, dependency-level-between-files criterion was utilized and integrated with the exponential growth/decay model to give an accurate file evaluation

    A queueing theory approach to Pareto-optimal bags-of-tasks scheduling on clouds

    Get PDF
    Cloud hosting services offer computing resources which can scale along with the needs of users. When access to data is limited by the network capacity this scalability also becomes limited. To investigate the impact of this limitation we focus on bags{of{tasks where task data is stored outside the cloud and has to be transferred across the network before task execution can commence. The existing bags-of-tasks estimation tools are not able to provide accurate estimates in such a case. We introduce a queuing{network inspired model which successfully models the limited network resources. Based on the Mean{Value Analysis of this model we derive an efficient procedure that results with an estimate of the makespan and the executions costs for a given configuration of cloud virtual machines. We compare the calculated Pareto set with measurements performed in a number of experiments for real-world bags-of-tasks and validate the proposed model and the accuracy of the estimated configurations

    An autonomic framework for enhancing the quality of data grid services

    Get PDF
    Data grid services have been used to deal with the increasing needs of applications in terms of data volume and throughput. The large scale, heterogeneity and dynamism of grid environments often make management and tuning of these data services very complex. Furthermore, current high-performance I/O approaches are characterized by their high complexity and specific features that usually require specialized administrator skills. Autonomic computing can help manage this complexity. The present paper describes an autonomic subsystem intended to provide self-management features aimed at efficiently reducing the I/O problem in a grid environment, thereby enhancing the quality of service (QoS) of data access and storage services in the grid. Our proposal takes into account that data produced in an I/O system is not usually immediately required. Therefore, performance improvements are related not only to current but also to any future I/O access, as the actual data access usually occurs later on. Nevertheless, the exact time of the next I/O operations is unknown. Thus, our approach proposes a long-term prediction designed to forecast the future workload of grid components. This enables the autonomic subsystem to determine the optimal data placement to improve both current and future I/O operations

    New approaches to data access in large-scale distributed system

    Get PDF
    Mención Internacional en el título de doctorA great number of scientific projects need supercomputing resources, such as, for example, those carried out in physics, astrophysics, chemistry, pharmacology, etc. Most of them generate, as well, a great amount of data; for example, a some minutes long experiment in a particle accelerator generates several terabytes of data. In the last years, high-performance computing environments have evolved towards large-scale distributed systems such as Grids, Clouds, and Volunteer Computing environments. Managing a great volume of data in these environments means an added huge problem since the data have to travel from one site to another through the internet. In this work a novel generic I/O architecture for large-scale distributed systems used for high-performance and high-throughput computing will be proposed. This solution is based on applying parallel I/O techniques to remote data access. Novel replication and data search schemes will also be proposed; schemes that, combined with the above techniques, will allow to improve the performance of those applications that execute in these environments. In addition, it will be proposed to develop simulation tools that allow to test these and other ideas without needing to use real platforms due to their technical and logistic limitations. An initial prototype of this solution has been evaluated and the results show a noteworthy improvement regarding to data access compared to existing solutions.Un gran número de proyectos científicos necesitan recursos de supercomputación como, por ejemplo, los llevados a cabo en física, astrofísica, química, farmacología, etc. Muchos de ellos generan, además, una gran cantidad de datos; por ejemplo, un experimento de unos minutos de duración en un acelerador de partículas genera varios terabytes de datos. Los entornos de computación de altas prestaciones han evolucionado en los últimos años hacia sistemas distribuidos a gran escala tales como Grids, Clouds y entornos de computación voluntaria. En estos entornos gestionar un gran volumen de datos supone un problema añadido de importantes dimensiones ya que los datos tienen que viajar de un sitio a otro a través de internet. En este trabajo se propondrá una nueva arquitectura de E/S genérica para sistemas distribuidos a gran escala usados para cómputo de altas prestaciones y de alta productividad. Esta solución se basa en la aplicación de técnicas de E/S paralela al acceso remoto a los datos. Así mismo, se estudiarán y propondrán nuevos esquemas de replicación y búsqueda de datos que, en combinación con las técnicas anteriores, permitan mejorar las prestaciones de aquellas aplicaciones que ejecuten en este tipo de entornos. También se propone desarrollar herramientas de simulación que permitan probar estas y otras ideas sin necesidad de recurrir a una plataforma real debido a las limitaciones técnicas y logísticas que ello supone. Se ha evaluado un prototipo inicial de esta solución y los resultados muestran una mejora significativa en el acceso a los datos sobre las soluciones existentes.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: David Expósito Singh.- Secretario: María de los Santos Pérez Hernández.- Vocal: Juan Manuel Tirado Mart

    Dynamic replication strategies in data grid systems: A survey

    Get PDF
    In data grid systems, data replication aims to increase availability, fault tolerance, load balancing and scalability while reducing bandwidth consumption, and job execution time. Several classification schemes for data replication were proposed in the literature, (i) static vs. dynamic, (ii) centralized vs. decentralized, (iii) push vs. pull, and (iv) objective function based. Dynamic data replication is a form of data replication that is performed with respect to the changing conditions of the grid environment. In this paper, we present a survey of recent dynamic data replication strategies. We study and classify these strategies by taking the target data grid architecture as the sole classifier. We discuss the key points of the studied strategies and provide feature comparison of them according to important metrics. Furthermore, the impact of data grid architecture on dynamic replication performance is investigated in a simulation study. Finally, some important issues and open research problems in the area are pointed out

    BitDew: A Programmable Environment for Large-Scale Data Management and Distribution

    Get PDF
    Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost
    corecore