10 research outputs found

    GeoLoc: Robust Resource Allocation Method for Query Optimization in Data Grid Systems

    Get PDF
    International audienceResource allocation (RA) is one of the key stages of distributed query processing in the Data Grid environment. In the last decade were published a number of works in the field that deals with different aspects of the problem. We believe that in those studies authors paid less attention to such important aspects as definition of allocation space and criterion of parallelism degree determination. In this paper we propose a method of RA that extends existing solutions in those two points of interest and resolves the problem in the specific conditions of the large scale heterogeneous environment of Data Grids. Firstly, we propose to use a geographical proximity of nodes to data sources to define the Allocation Space (AS). Secondly, we present the principle of execution time parity between scan and join (build and probe) operations for determination of parallelism degree and for generation of load balanced query execution plans. We conducted an experiment that proved the superiority of our GeoLoc method in terms of response time over the RA method that we chose for the comparison. The present study provides also a brief description of existing methods and their qualitative comparison with respect to proposed method

    Data and Task Scheduling in Distributed Computing Environments, Journal of Telecommunications and Information Technology, 2014, nr 4

    Get PDF
    ecome a major research and engineering issue. Data Grids (DGs), Data Clouds (DCs) and Data Centers are designed for supporting the processing and analysis of massive data, which can be generated by distributed users, devices and computing centers. Data scheduling must be considered jointly with the application scheduling process. It generates a wide family of global optimization problems with the new scheduling criteria including data transmission time, data access and processing times, reliability of the data servers, security in the data processing and data access processes. In this paper, a new version of the Expected Time to Compute Matrix (ETC Matrix) model is defined for independent batch scheduling in physical network in DG and DC environments. In this model, the completion times of the computing nodes are estimated based on the standard ETC Matrix and data transmission times. The proposed model has been empirically evaluated on the static grid scheduling benchmark by using the simple genetic-based schedulers. A simple comparison of the achieved results for two basic scheduling metrics, namely makespan and average flowtime, with the results generated in the case of ignoring the data scheduling phase show the significant impact of the data processing model on the schedule execution times

    An SCP-based Heuristic Approach for Scheduling Distributed Data-Intensive Applications on Global Grids

    No full text
    Data-intensive Grid applications need access to large datasets that may each be replicated on different resources. Minimizing the overhead of transferring these datasets to the resources where the applications are executed requires that appropriate computational and data resources be selected. In this paper, we consider the problem of scheduling an application composed of a set of independent tasks, each of which requires multiple datasets that are each replicated on multiple resources. We break this problem into two parts: one, to match each task (or job) to one compute resource for executing the job and one storage resource each for accessing each dataset required by the job and two, to assign the set of tasks to the selected resources. We model the first part as an instance of the well-known Set Covering Problem (SCP) and apply a known heuristic for SCP to match jobs to resources. The second part is tackled by extending existing MinMin and Sufferage algorithms to schedule the set of distributed data-intensive tasks. Through simulation, we experimentally compare the SCP-based matching heuristic to others in conjunction with the task scheduling algorithms and present the results

    Data Placement And Task Mapping Optimization For Big Data Workflows In The Cloud

    Get PDF
    Data-centric workflows naturally process and analyze a huge volume of datasets. In this new era of Big Data there is a growing need to enable data-centric workflows to perform computations at a scale far exceeding a single workstation\u27s capabilities. Therefore, this type of applications can benefit from distributed high performance computing (HPC) infrastructures like cluster, grid or cloud computing. Although data-centric workflows have been applied extensively to structure complex scientific data analysis processes, they fail to address the big data challenges as well as leverage the capability of dynamic resource provisioning in the Cloud. The concept of “big data workflows” is proposed by our research group as the next generation of data-centric workflow technologies to address the limitations of exist-ing workflows technologies in addressing big data challenges. Executing big data workflows in the Cloud is a challenging problem as work-flow tasks and data are required to be partitioned, distributed and assigned to the cloud execution sites (multiple virtual machines). In running such big data work-flows in the cloud distributed across several physical locations, the workflow execution time and the cloud resource utilization efficiency highly depends on the initial placement and distribution of the workflow tasks and datasets across the multiple virtual machines in the Cloud. Several workflow management systems have been developed for scientists to facilitate the use of workflows; however, data and work-flow task placement issue has not been sufficiently addressed yet. In this dissertation, I propose BDAP strategy (Big Data Placement strategy) for data placement and TPS (Task Placement Strategy) for task placement, which improve workflow performance by minimizing data movement across multiple virtual machines in the Cloud during the workflow execution. In addition, I propose CATS (Cultural Algorithm Task Scheduling) for workflow scheduling, which improve workflow performance by minimizing workflow execution cost. In this dissertation, I 1) formalize data and task placement problems in workflows, 2) propose a data placement algorithm that considers both initial input dataset and intermediate datasets obtained during workflow run, 3) propose a task placement algorithm that considers placement of workflow tasks before workflow run, 4) propose a workflow scheduling strategy to minimize the workflow execution cost once the deadline is provided by user and 5)perform extensive experiments in the distributed environment to validate that our proposed strategies provide an effective data and task placement solution to distribute and place big datasets and tasks into the appropriate virtual machines in the Cloud within reasonable time

    Allocation des ressources pour l'optimisation de requĂȘtes dans les systĂšmes de grille de donnĂ©es

    Get PDF
    Les systĂšmes de grille de donnĂ©es sont de plus en plus utilisĂ©s grĂące Ă  leur capacitĂ© de stockage et de calcul. L'un des problĂšmes importants de ces systĂšmes est l'allocation de ressources pour l'optimisation de requĂȘtes SQL. RĂ©cemment, la communautĂ© scientifique a publiĂ© plusieurs approches et mĂ©thodes d'allocation de ressources, en s'efforçant de tenir compte des diffĂ©rentes spĂ©cificitĂ©s de systĂšmes de grille de donnĂ©es : l'hĂ©tĂ©rogĂ©nĂ©itĂ©, l'instabilitĂ© du systĂšme et la grande Ă©chelle. La structure de gestion centralisĂ©e prĂ©domine dans les mĂ©thodes proposĂ©es, malgrĂ© les risques encourus par cette solution dans les systĂšmes Ă  grande Ă©chelle. Dans cette thĂšse nous proposons une mĂ©thode d'allocation de ressources hybride et dĂ©centralisĂ©e pour l'optimisation d'une requĂȘte. La partie statique de notre mĂ©thode constitue la stratĂ©gie d'allocation initiale de ressources par un 'broker' d'une requĂȘte. Quant Ă  la partie dynamique, nous proposons une stratĂ©gie, qui utilise la coopĂ©ration entre des opĂ©rations relationnelles mobiles autonomes et des coordinateurs stationnaires des nƓuds pour dĂ©centraliser le processus de rĂ©allocation dynamique de ressources. Les Ă©lĂ©ments clĂ©s de notre mĂ©thode sont : (i) la limitation de l'espace de recherche pour rĂ©soudre les problĂšmes causĂ©s par la grande Ă©chelle, (ii) le principe de rĂ©partition des ressources entre les opĂ©rations d'une requĂȘte pour dĂ©terminer le degrĂ© de parallĂ©lisme des opĂ©rations et pour Ă©quilibrer la charge dynamiquement et (iii) la dĂ©centralisation du processus d'allocation dynamique. Les rĂ©sultats de l'Ă©valuation des performances de notre mĂ©thode montrent l'efficacitĂ© de nos propositions. Notre stratĂ©gie d'allocation initiale de ressources a donnĂ© des rĂ©sultats supĂ©rieurs Ă  la mĂ©thode de rĂ©fĂ©rence que nous avons utilisĂ©e pour la comparaison. La stratĂ©gie de rĂ©allocation dynamique de ressources rĂ©duit notablement le temps de rĂ©ponse en prĂ©sence de l'instabilitĂ© du systĂšme et du dĂ©sĂ©quilibre de charge.Data grid systems are utilized more and more due to their storage and computing capacities. One of the main problems of these systems is the resource allocation for SQL query optimization. Recently, the scientific community published numerous approaches and methods of resource allocation, striving to take into account different peculiarities of data grid systems: heterogeneity, instability and large scale. Centralized management structure predominates in the proposed methods, in spite of the risks incurred of the solution in the large scale systems. In the thesis we adopt the hybrid approach of resource allocation for query optimization, meaning that, we first make a static resource allocation during the query compile time, and then reallocate the resources dynamically during the query runtime. As opposed to the previously proposed methods, we use a decentralized management structure. The static part of our method consists of the strategy of initial resource allocation with a query 'broker'. As for the dynamic part, we propose a strategy that uses the cooperation between relational mobile operations and stationary coordinators of nodes in order to decentralize the process of dynamic resource reallocation. Key elements of our method are: (i) limitation of research space for resolve problems caused by the large scale, (ii) principle of resources distribution between query operations for determining the parallelism degree of operations and for balancing the load dynamically and (iii) decentralization of the dynamic allocation process. Results of performance evaluation show the efficiency of our propositions. Our initial resource allocation strategy gives results superior to the referenced method that we used for the comparison. The dynamic reallocation strategy decrease considerably the response time in the presence of the system instability and load misbalance

    GREEDY SINGLE USER AND FAIR MULTIPLE USERS REPLICA SELECTION DECISION IN DATA GRID

    Get PDF
    Replication in data grids increases data availability, accessibility and reliability. Replicas of datasets are usually distributed to different sites, and the choice of any replica locations has a significant impact. Replica selection algorithms decide the best replica places based on some criteria. To this end, a family of efficient replica selection systems has been proposed (RsDGrid). The problem presented in this thesis is how to select the best replica location that achieve less time, higher QoS, consistency with users' preferences and almost equal users' satisfactions. RsDGrid consists of three systems: A-system, D-system, and M-system. Each of them has its own scope and specifications. RsDGrid switches among these systems according to the decision maker
    corecore