11 research outputs found
Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines
Although modern supercomputers are composed of multicore machines, one can
find scientists that still execute their legacy applications which were
developed to monocore cluster where memory hierarchy is dedicated to a sole
core. The main objective of this paper is to propose and evaluate an algorithm
that identify an efficient blocksize to be applied on MPI stencil computations
on multicore machines. Under the light of an extensive experimental analysis,
this work shows the benefits of identifying blocksizes that will dividing data
on the various cores and suggest a methodology that explore the memory
hierarchy available in modern machines
Versatile Communication Cost Modelling for Multicomputer Task Scheduling
Institute for Computing Systems ArchitectureProgrammers face daunting problems when attempting to design portable
programs for multicomputers. This is mainly due to the huge variation
in communication performance on the range of multicomputer platforms
currently in use. These programmers require a computational model that
is sufficiently abstract to allow them to ignore machine-specific
performance features, and yet is sufficiently versatile to allow the
computational structure to be mapped efficiently to a wide range of
multicomputer platforms.
This dissertation focusses on parallel computations that can be
expressed as task graphs: tasks that must be scheduled on the
multicomputer's processors. In the past, scheduling models have only
considered the message delay as the predominant communication
parameter. In the current generation of parallel machines, however,
latency is negligible compared to the CPU penalty of
communication-related activity associated with inter-processor
communication. This CPU penalty cannot be modelled by a latency
parameter because the CPU activity consumes time otherwise available
for useful computation. In view of this, we consider a model in which
the CPU penalty is significant and is associated with communication
events that are incurred when applications execute in parallel.
In this dissertation a new multi-stage scheduling approach that takes
into account these communication parameters is proposed. Initially, in
the first stage, the input task graph is transformed into a new
structure that can be scheduled with a smaller number of communication
events. Task replication is incorporated to produce clusters of tasks.
However, a different view of clusters is adopted. Tasks are clustered
so that messages are bundled and consequently, the number of
communication events is decreased. The communication event tasks are
associated with the relationship between the clusters. More
specifically, this stage comprises a family of scheduling heuristics
that can be customised to classes of parallel machines, according to
their communication performance characteristics, through
parameterisation and by varying the order in which the heuristics are
applied. A second stage is necessary, where the actual schedule on
the target machine is defined. The mechanisms implemented analyse
carefully the clusters and their relationship so that communication
costs are minimised and the degree of parallelism is exploited.
Therefore, the aim of the proposed approach is to tackle the min-max
problem, considering realistic architectural issues
TOWARDS OPTIMAL STATIC TASK SCHEDULING FOR REALISTIC MACHINE MODELS: THEORY AND PRACTICE
Task scheduling is a key element in achieving high performance from multicomputer systems. Efficient scheduling algorithms reduce the interprocessor communication and improve processor utilization. To do so effectively, such algorithms must be based on a communication cost model appropriate for computing systems in use. The optimal scheduling of tasks is NP-hard, and a large number of heuristic algorithms have been proposed for a range of differing scheduling conditions (graph types, granularities and cost or architectural models). Unfortunately, due both to the variety of systems available and the rate at which these systems evolve, an appropriate representative cost model has yet to be established. In this paper we study the problem of task scheduling unde
On the Scope of Applicability of the ETF Algorithm
Superficially, the Earliest Task First (ETF) heuristic [1] is attractive because it models heterogeneous messages passing through a heterogeneous network. On closer inspection, however, this is precisely the set of circumstances that can cause ETF to produce seriously sub-optimal schedules. In this paper we analyze the scope of applicability of ETF. We show that ETF has a good performance if messages are short and the links are fast and a poor performance otherwise. For the first application we choose the Diamond DAG with unit execution time for each task and the multiprocessor system in the form of the fully connected network. We show that ETF partitions the DAG into lines each of which is scheduled on the same processor. The analysis reveals that if the communication times between pairs of adjacent tasks in a precedence relation are all less than or equal to unit then the schedule is optimal. If the communication time is equal to the processing time needed to evaluate a row then the ..
Towards an Effective Task Clustering Heuristic for LogP Machines
This paper describes a task scheduling algorithm, based on a LogP-type model, for allocating arbitrary task graphs to fully connected networks of processors. This problem is known to be NP-complete even under the delay model (a special case under the LogP model). The strategy exploits the replication and clustering of tasks to minimise the ill effects of communication overhead on the makespan. The quality of the schedules produced by this LogP-based algorithm, initially under delay model conditions, is compared with that of other good delay model-based approaches
Harnessing Low-Cost Virtual Machines on the Spot
International audiencePublic cloud providers offer computing resources through a plethora of Virtual Machine (VM) instances of different capacities. Each instance is composed of a pre-determined set of virtualized hardware components of different types and/or quantities (number of cores, memory, storage and bandwidth capacities, etc.), in an attempt to satisfy the demands of a diverse range of user applications. Typically, cloud providers offer these instances under several contract models that differ in terms of availability guarantees and prices (On-demand, Spot, Reserved). This chapter provides an overview on how users might utilize and benefit from the variety of instances and different contract models on offer from public cloud providers to reduce their financial outlays. A methodology to dynamically schedule applications with deadline constraints in both hibernation-prone Spot VMs and On-Demand Instances in order to lower costs in relation to a pure On-demand solution is described. Independent of the chosen contract model, identifying the appropriate instance type for applications is also important when attempting to trim expenses. Since it may not be obvious, a short discussion motivates why this decision is not solely related to defining the required resource capacities the chosen instances should have. Finally, given that some cloud providers have recently introduced the concept of Burstable Instances that can boost their performance for a limited period of time, the chapter closes with a summary of approaches that exploit the discounted rates afforded by this new instance class
Optimizing computational costs of Spark for SARSâCoVâ2 sequences comparisons on a commercial cloud
International audienceCloud computing is currently one of the prime choices in the computing infrastructure landscape. In addition to advantages such as the pay-per-use bill model and resource elasticity, there are technical benefits regarding heterogeneity and large-scale configuration. Alongside the classical need for performance, for example, time, space, and energy, there is an interest in the financial cost that might come from budget constraints. Based on scalability considerations and the pricing model of traditional public clouds, a reasonable optimization strategy output could be the most suitable configuration of virtual machines to run a specific workload. From the perspective of runtime and monetary cost optimizations, we provide the adaptation of a Hadoop applications execution cost model extracted from the literature aiming at Spark applications modeled with the MapReduce paradigm. We evaluate our optimizer model executing an improved version of the Diff Sequences Spark application to perform SARS-CoV-2 coronavirus pairwise sequence comparisons using the AWS EC2's virtual machine instances. The experimental results with our model outperformed 80% of the random resource selection scenarios. By only employing spot worker nodes exposed to revocation scenarios rather than on-demand workers, we obtained an average monetary cost reduction of 35.66% with a slight runtime increase of 3.36%
COMBINATORIAL OPTIMIZATION TECHNICQUES APPLIED TO A PARALLEL PRECONDITIONER BASED ON THE SPIKE ALGORITHM
Abstract. Parallel algorithms capable to use efficiently thousands of multi-core processors is the trend in High Performance Computing. To achieve a high scalability, hybrid solvers are suitable candidates since they can combine the robustness of direct methods and the low computational cost of iterative methods. The parallel hybrid SPIKE algorith