Search CORE

10,596 research outputs found

A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

Author: Fox Geoffrey C.
Jha Shantenu
Luckow Andre
Mantha Pradeep
Qiu Judy
Publication venue
Publication date: 01/01/2014
Field of study

Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling

Author: Bonifaci Vincenzo
Dangelo Gianlorenzo
Marchetti-Spaccamela Alberto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed

Crossref

Archivio della ricerca- Università di Roma La Sapienza

A Three-Level Parallelisation Scheme and Application to the Nelder-Mead Algorithm

Author: Bugajev Andrej
Kriauzienė Rima
Čiegis Raimondas
Publication venue
Publication date: 22/09/2019
Field of study

We consider a three-level parallelisation scheme. The second and third levels define a classical two-level parallelisation scheme and some load balancing algorithm is used to distribute tasks among processes. It is well-known that for many applications the efficiency of parallel algorithms of the second and third level starts to drop down after some critical parallelisation degree is reached. This weakness of the two-level template is addressed by introduction of one additional parallelisation level. As an alternative to the basic solver some new or modified algorithms are considered on this level. The idea of the proposed methodology is to increase the parallelisation degree by using less efficient algorithms in comparison with the basic solver. As an example we investigate two modified Nelder-Mead methods. For the selected application, a few partial differential equations are solved numerically on the second level, and on the third level the parallel Wang's algorithm is used to solve systems of linear equations with tridiagonal matrices. A greedy workload balancing heuristic is proposed, which is oriented to the case of a large number of available processors. The complexity estimates of the computational tasks are model-based, i.e. they use empirical computational data

arXiv.org e-Print Archive

Directory of Open Access Journals

VGTU Journals (Vilnius Gediminas Technical University - Vilnius Tech)

Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies

Author: Asadzadeh Parvin
Buyya Rajkumar
Kei Chun Ling
Nayar Deepa
Venugopal Srikumar
Publication venue
Publication date: 01/07/2004
Field of study

Grid is an infrastructure that involves the integrated and collaborative use of computers, networks, databases and scientific instruments owned and managed by multiple organizations. Grid applications often involve large amounts of data and/or computing resources that require secure resource sharing across organizational boundaries. This makes Grid application management and deployment a complex undertaking. Grid middlewares provide users with seamless computing ability and uniform access to resources in the heterogeneous Grid environment. Several software toolkits and systems have been developed, most of which are results of academic research projects, all over the world. This chapter will focus on four of these middlewares--UNICORE, Globus, Legion and Gridbus. It also presents our implementation of a resource broker for UNICORE as this functionality was not supported in it. A comparison of these systems on the basis of the architecture, implementation model and several other features is included.Comment: 19 pages, 10 figure

arXiv.org e-Print Archive

Enlighten

Managing Uncertainty: A Case for Probabilistic Grid Scheduling

Author: Lazarevic Aleksandar
Prnjat Ognjen
Sacks Lionel
Publication venue
Publication date: 01/07/2006
Field of study

The Grid technology is evolving into a global, service-orientated architecture, a universal platform for delivering future high demand computational services. Strong adoption of the Grid and the utility computing concept is leading to an increasing number of Grid installations running a wide range of applications of different size and complexity. In this paper we address the problem of elivering deadline/economy based scheduling in a heterogeneous application environment using statistical properties of job historical executions and its associated meta-data. This approach is motivated by a study of six-month computational load generated by Grid applications in a multi-purpose Grid cluster serving a community of twenty e-Science projects. The observed job statistics, resource utilisation and user behaviour is discussed in the context of management approaches and models most suitable for supporting a probabilistic and autonomous scheduling architecture

arXiv.org e-Print Archive

CiteSeerX

UCL Discovery

Efficient Resource Matching in Heterogeneous Grid Using Resource Vector

Author: Addepallil Srirangam V
Andersen Per
Barnes George L
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 07/06/2010
Field of study

In this paper, a method for efficient scheduling to obtain optimum job throughput in a distributed campus grid environment is presented; Traditional job schedulers determine job scheduling using user and job resource attributes. User attributes are related to current usage, historical usage, user priority and project access. Job resource attributes mainly comprise of soft requirements (compilers, libraries) and hard requirements like memory, storage and interconnect. A job scheduler dispatches jobs to a resource if a job's hard and soft requirements are met by a resource. In current scenario during execution of a job, if a resource becomes unavailable, schedulers are presented with limited options, namely re-queuing job or migrating job to a different resource. Both options are expensive in terms of data and compute time. These situations can be avoided, if the often ignored factor, availability time of a resource in a grid environment is considered. We propose resource rank approach, in which jobs are dispatched to a resource which has the highest rank among all resources that match the job's requirement. The results show that our approach can increase throughput of many serial / monolithic jobs.Comment: 10 page

arXiv.org e-Print Archive

CiteSeerX

Crossref