2,507 research outputs found

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    Disk storage management for LHCb based on Data Popularity estimator

    Full text link
    This paper presents an algorithm providing recommendations for optimizing the LHCb data storage. The LHCb data storage system is a hybrid system. All datasets are kept as archives on magnetic tapes. The most popular datasets are kept on disks. The algorithm takes the dataset usage history and metadata (size, type, configuration etc.) to generate a recommendation report. This article presents how we use machine learning algorithms to predict future data popularity. Using these predictions it is possible to estimate which datasets should be removed from disk. We use regression algorithms and time series analysis to find the optimal number of replicas for datasets that are kept on disk. Based on the data popularity and the number of replicas optimization, the algorithm minimizes a loss function to find the optimal data distribution. The loss function represents all requirements for data distribution in the data storage system. We demonstrate how our algorithm helps to save disk space and to reduce waiting times for jobs using this data

    Crucial File Selection Strategy (CFSS) for Enhanced Download Response Time in Cloud Replication Environments

    Get PDF
    الحوسبة السحابية هي عبارة عن منصة ضخمة لتقديم بيانات كبيرة الحجم من أجهزة متعددة وتقنيات مختلفه. هناك طلب كبير من قبل مستأجري السحابة للوصول إلى بياناتهم بشكل أسرع دون أي انقطاع. يبدل مقدمو الخدمات السحابية كل جهدهم لضمان تأمين كل البيانات الفردية وإمكانية الوصول إليها دائمًا. ومن الملاحظ بإن استراتيجية النسخ المتماثل المناسبة القادرة على اختيار البيانات الأساسية مطلوبة في بيئات النسخ السحابي كأحد الحلول. اقترحت هذه الورقة استراتيجية اختيار الملفات الحاسمة (CFSS) لمعالجة وقت الاستجابة الضعيف في بيئة النسخ المتماثل السحابي. يتم استخدام محاكي سحابة يسمى CloudSim لإجراء التجارب اللازمة ، ويتم تقديم النتائج لإثبات التحسن في أداء النسخ المتماثل. تمت مناقشة الرسوم البيانية التحليلية التي تم الحصول عليها بدقة ، وأظهرت النتائج تفوق خوارزمية CFSS المقترحة على خوارزمية أخرى موجودة مع تحسن بنسبة 10.47 ٪ في متوسط ​​وقت الاستجابة لوظائف متعددة في كل جولة.Cloud Computing is a mass platform to serve high volume data from multi-devices and numerous technologies. Cloud tenants have a high demand to access their data faster without any disruptions. Therefore, cloud providers are struggling to ensure every individual data is secured and always accessible. Hence, an appropriate replication strategy capable of selecting essential data is required in cloud replication environments as the solution. This paper proposed a Crucial File Selection Strategy (CFSS) to address poor response time in a cloud replication environment. A cloud simulator called CloudSim is used to conduct the necessary experiments, and results are presented to evidence the enhancement on replication performance. The obtained analytical graphs are discussed thoroughly, and apparently, the proposed CFSS algorithm outperformed another existing algorithm with a 10.47% improvement in average response time for multiple jobs per round

    Replica maintenance strategy for data grid

    Get PDF
    Data Grid is an infrastructure that manages huge amount of data files, and provides intensive computational resources across geographically distributed collaboration.Increasing the performance of such system can be achieved by improving the overall resource usage, which includes network and storage resources.Improving network resource usage is achieved by good utilization of network bandwidth that is considered as an important factor affecting job execution time.Meanwhile, improving storage resource usage is achieved by good utilization of storage space usage. Data replication is one of the methods used to improve the performance of data access in distributed systems by replicating multiple copies of data files in the distributed sites.Having distributed the replicas to various locations, they need to be monitored.As a result of dynamic changes in the data grid environment, some of the replicas need to be relocated.In this paper we proposed a maintenance replica placement strategy termed as Unwanted Replica Deletion Strategy (URDS) as a part of Replica maintenance service.The main purpose of the proposed strategy is to find the placement of unwanted replicas to be deleted.OptorSim is used to evaluate the performance of the proposed strategy. The simulation results show that URDS requires less execution time and consumes less network usage and has a best utilization of storage space usage compared to existing approaches

    Replica Creation Algorithm for Data Grids

    Get PDF
    Data grid system is a data management infrastructure that facilitates reliable access and sharing of large amount of data, storage resources, and data transfer services that can be scaled across distributed locations. This thesis presents a new replication algorithm that improves data access performance in data grids by distributing relevant data copies around the grid. The new Data Replica Creation Algorithm (DRCM) improves performance of data grid systems by reducing job execution time and making the best use of data grid resources (network bandwidth and storage space). Current algorithms focus on number of accesses in deciding which file to replicate and where to place them, which ignores resources’ capabilities. DRCM differs by considering both user and resource perspectives; strategically placing replicas at locations that provide the lowest transfer cost. The proposed algorithm uses three strategies: Replica Creation and Deletion Strategy (RCDS), Replica Placement Strategy (RPS), and Replica Replacement Strategy (RRS). DRCM was evaluated using network simulation (OptorSim) based on selected performance metrics (mean job execution time, efficient network usage, average storage usage, and computing element usage), scenarios, and topologies. Results revealed better job execution time with lower resource consumption than existing approaches. This research contributes replication strategies embodied in one algorithm that enhances data grid performance, capable of making a decision on creating or deleting more than one file during same decision. Furthermore, dependency-level-between-files criterion was utilized and integrated with the exponential growth/decay model to give an accurate file evaluation
    corecore