2,963 research outputs found
Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors
In this paper we consider the problem of identifying intersections between
two sets of d-dimensional axis-parallel rectangles. This is a common problem
that arises in many agent-based simulation studies, and is of central
importance in the context of High Level Architecture (HLA), where it is at the
core of the Data Distribution Management (DDM) service. Several realizations of
the DDM service have been proposed; however, many of them are either
inefficient or inherently sequential. These are serious limitations since
multicore processors are now ubiquitous, and DDM algorithms -- being
CPU-intensive -- could benefit from additional computing power. We propose a
parallel version of the Sort-Based Matching algorithm for shared-memory
multiprocessors. Sort-Based Matching is one of the most efficient serial
algorithms for the DDM problem, but is quite difficult to parallelize due to
data dependencies. We describe the algorithm and compute its asymptotic running
time; we complete the analysis by assessing its performance and scalability
through extensive experiments on two commodity multicore systems based on a
dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on
Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper
Award @DS-RT 201
Revisiting Matrix Product on Master-Worker Platforms
This paper is aimed at designing efficient parallel matrix-product algorithms
for heterogeneous master-worker platforms. While matrix-product is
well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm
and ScaLAPACK outer product algorithm), there are three key hypotheses that
render our work original and innovative:
- Centralized data. We assume that all matrix files originate from, and must
be returned to, the master.
- Heterogeneous star-shaped platforms. We target fully heterogeneous
platforms, where computational resources have different computing powers.
- Limited memory. Because we investigate the parallelization of large
problems, we cannot assume that full matrix panels can be stored in the worker
memories and re-used for subsequent updates (as in ScaLAPACK).
We have devised efficient algorithms for resource selection (deciding which
workers to enroll) and communication ordering (both for input and result
messages), and we report a set of numerical experiments on various platforms at
Ecole Normale Superieure de Lyon and the University of Tennessee. However, we
point out that in this first version of the report, experiments are limited to
homogeneous platforms
Asymptotic Analysis of Plausible Tree Hash Modes for SHA-3
Discussions about the choice of a tree hash mode of operation for a
standardization have recently been undertaken. It appears that a single tree
mode cannot address adequately all possible uses and specifications of a
system. In this paper, we review the tree modes which have been proposed, we
discuss their problems and propose remedies. We make the reasonable assumption
that communicating systems have different specifications and that software
applications are of different types (securing stored content or live-streamed
content). Finally, we propose new modes of operation that address the resource
usage problem for the three most representative categories of devices and we
analyse their asymptotic behavior
Heuristics Techniques for Scheduling Problems with Reducing Waiting Time Variance
In real computational world, scheduling is a decision making process. This is nothing but a systematic schedule through which a large numbers of tasks are assigned to the processors. Due to the resource limitation, creation of such schedule is a real challenge. This creates the interest of developing a qualitative scheduler for the processors. These processors are either single or parallel. One of the criteria for improving the efficiency of scheduler is waiting time variance (WTV). Minimizing the WTV of a task is a NP-hard problem. Achieving the quality of service (QoS) in a single or parallel processor by minimizing the WTV is a problem of task scheduling. To enhance the performance of a single or parallel processor, it is required to develop a stable and none overlap scheduler by minimizing WTV. An automated scheduler\u27s performance is always measured by the attributes of QoS. One of the attributes of QoS is âTimelinessâ. First, this chapter presents the importance of heuristics with five heuristic-based solutions. Then applies these heuristics on 1âWTV minimization problem and three heuristics with a unique task distribution mechanism on Qm|prec|WTV minimization problem. The experimental result shows the performance of heuristic in the form of graph for consonant problems
Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan
We consider energy-efficient scheduling on multiprocessors, where the speed
of each processor can be individually scaled, and a processor consumes power
when running at speed , for . A scheduling algorithm
needs to decide at any time both processor allocations and processor speeds for
a set of parallel jobs with time-varying parallelism. The objective is to
minimize the sum of the total energy consumption and certain performance
metric, which in this paper includes total flow time and makespan. For both
objectives, we present instantaneous parallelism clairvoyant (IP-clairvoyant)
algorithms that are aware of the instantaneous parallelism of the jobs at any
time but not their future characteristics, such as remaining parallelism and
work. For total flow time plus energy, we present an -competitive
algorithm, which significantly improves upon the best known non-clairvoyant
algorithm and is the first constant competitive result on multiprocessor speed
scaling for parallel jobs. In the case of makespan plus energy, which is
considered for the first time in the literature, we present an
-competitive algorithm, where is the total number of
processors. We show that this algorithm is asymptotically optimal by providing
a matching lower bound. In addition, we also study non-clairvoyant scheduling
for total flow time plus energy, and present an algorithm that achieves -competitive for jobs with arbitrary release time and
-competitive for jobs with identical release time. Finally,
we prove an lower bound on the competitive ratio of
any non-clairvoyant algorithm, matching the upper bound of our algorithm for
jobs with identical release time
Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore Architectures
In this paper, we perform an empirical evaluation of the Parallel External
Memory (PEM) model in the context of geometric problems. In particular, we
implement the parallel distribution sweeping framework of Ajwani, Sitchinava
and Zeh to solve batched 1-dimensional stabbing max problem. While modern
processors consist of sophisticated memory systems (multiple levels of caches,
set associativity, TLB, prefetching), we empirically show that algorithms
designed in simple models, that focus on minimizing the I/O transfers between
shared memory and single level cache, can lead to efficient software on current
multicore architectures. Our implementation exhibits significantly fewer
accesses to slow DRAM and, therefore, outperforms traditional approaches based
on plane sweep and two-way divide and conquer.Comment: Longer version of ESA'13 pape
Automated problem scheduling and reduction of synchronization delay effects
It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs
- âŠ