Search CORE

645 research outputs found

Recommended from our members

Parallelizing non-vectorizable loops for MIMD machines

Author: Kim Ki-chang
Nicolau Alexandru
Publication venue: eScholarship, University of California
Publication date: 01/01/1990
Field of study

Parallelizing a loop for MIMD machines can be described as a process of partitioning it into a number of relatively independent subloops. Previous approaches to partitioning non-vectorizable loops were mainly based on iteration pipelining which partitioned a loop based on iteration number and exploited parallelism by overlapping the execution of iterations. However, the amount of parallelism exploited this way is limited because the parallelism inside iterations has been ignored. In this paper, we present a new loop partitioning technique which can exploit both forms of parallelism - inside and across iterations. While inspired by the VLIW approach, our method is designed for more general, asynchronous, MIMD machines. In particular, our schedule takes the cost of communication into account, and attempts to balance it with respect to parallelism. We show our method is correct, efficient, and produces better schedules than previous iteration level approaches

eScholarship - University of California

Performance and evaluation of real-time multicomputer control systems

Author: Shin K. G.
Publication venue
Publication date
Field of study

New performance measures, detailed examples, modeling of error detection process, performance evaluation of rollback recovery methods, experiments on FTMP, and optimal size of an NMR cluster are discussed

NASA Technical Reports Server

Quasirandom Load Balancing

Author: Lovász L.
Martin Gairing
S.
Subramanian R.
Thomas Sauerwald
Tobias Friedrich
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2012
Field of study

We propose a simple distributed algorithm for balancing indivisible tokens on graphs. The algorithm is completely deterministic, though it tries to imitate (and enhance) a random algorithm by keeping the accumulated rounding errors as small as possible. Our new algorithm surprisingly closely approximates the idealized process (where the tokens are divisible) on important network topologies. On d-dimensional torus graphs with n nodes it deviates from the idealized process only by an additive constant. In contrast to that, the randomized rounding approach of Friedrich and Sauerwald (2009) can deviate up to Omega(polylog(n)) and the deterministic algorithm of Rabani, Sinclair and Wanka (1998) has a deviation of Omega(n^{1/d}). This makes our quasirandom algorithm the first known algorithm for this setting which is optimal both in time and achieved smoothness. We further show that also on the hypercube our algorithm has a smaller deviation from the idealized process than the previous algorithms.Comment: 25 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe

Jigsaw: Scalable software-defined caches

Author: Beckmann Nathan Zachary
Sanchez Daniel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2013
Field of study

Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm

DSpace@MIT

Crossref

Stochastic Analysis of Power-Aware Scheduling

Author: Andrew Lachlan L. H.
Tang Ao
Wierman Adam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Energy consumption in a computer system can be reduced by dynamic speed scaling, which adapts the processing speed to the current load. This paper studies the optimal way to adjust speed to balance mean response time and mean energy consumption, when jobs arrive as a Poisson process and processor sharing scheduling is used. Both bounds and asymptotics for the optimal speeds are provided. Interestingly, a simple scheme that halts when the system is idle and uses a static rate while the system is busy provides nearly the same performance as the optimal dynamic speed scaling. However, dynamic speed scaling which allocates a higher speed when more jobs are present significantly improves robustness to bursty traffic and mis-estimation of workload parameters

CiteSeerX

Crossref

Caltech Authors

Swinburne Research Bank

Speedup stacks: identifying scaling Bottlenecks in multi-threaded applications

Author: Du Bois Kristof
Eeckhout Lieven
Eyerman Stijn
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Multi-threaded workloads typically show sublinear speedup on multi-core hardware, i.e., the achieved speedup is not proportional to the number of cores and threads. Sublinear scaling may have multiple causes, such as poorly scalable synchronization leading to spinning and/or yielding, and interference in shared resources such as the lastlevel cache (LLC) as well as the main memory subsystem. It is vital for programmers and processor designers to understand scaling bottlenecks in existing and emerging workloads in order to optimize application performance and design future hardware. In this paper, we propose the speedup stack, which quantifies the impact of the various scaling delimiters on multithreaded application speedup in a single stack. We describe a mechanism for computing speedup stacks on a multi-core processor, and we find speedup stacks to be accurate within 5.1% on average for sixteen-threaded applications. We present several use cases: we discuss how speedup stacks can be used to identify scaling bottlenecks, classify benchmarks, optimize performance, and understand LLC performance

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23