645 research outputs found
Recommended from our members
Parallelizing non-vectorizable loops for MIMD machines
Parallelizing a loop for MIMD machines can be described as a process of partitioning it into a number of relatively independent subloops. Previous approaches to partitioning non-vectorizable loops were mainly based on iteration pipelining which partitioned a loop based on iteration number and exploited parallelism by overlapping the execution of iterations. However, the amount of parallelism exploited this way is limited because the parallelism inside iterations has been ignored. In this paper, we present a new loop partitioning technique which can exploit both forms of parallelism - inside and across iterations. While inspired by the VLIW approach, our method is designed for more general, asynchronous, MIMD machines. In particular, our schedule takes the cost of communication into account, and attempts to balance it with respect to parallelism. We show our method is correct, efficient, and produces better schedules than previous iteration level approaches
Performance and evaluation of real-time multicomputer control systems
New performance measures, detailed examples, modeling of error detection process, performance evaluation of rollback recovery methods, experiments on FTMP, and optimal size of an NMR cluster are discussed
Quasirandom Load Balancing
We propose a simple distributed algorithm for balancing indivisible tokens on
graphs. The algorithm is completely deterministic, though it tries to imitate
(and enhance) a random algorithm by keeping the accumulated rounding errors as
small as possible.
Our new algorithm surprisingly closely approximates the idealized process
(where the tokens are divisible) on important network topologies. On
d-dimensional torus graphs with n nodes it deviates from the idealized process
only by an additive constant. In contrast to that, the randomized rounding
approach of Friedrich and Sauerwald (2009) can deviate up to Omega(polylog(n))
and the deterministic algorithm of Rabani, Sinclair and Wanka (1998) has a
deviation of Omega(n^{1/d}). This makes our quasirandom algorithm the first
known algorithm for this setting which is optimal both in time and achieved
smoothness. We further show that also on the hypercube our algorithm has a
smaller deviation from the idealized process than the previous algorithms.Comment: 25 page
Jigsaw: Scalable software-defined caches
Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm
Stochastic Analysis of Power-Aware Scheduling
Energy consumption in a computer system can be reduced by dynamic speed scaling, which adapts the processing speed to the current load. This paper studies the optimal way to adjust speed to balance mean response time and mean energy consumption, when jobs arrive as a Poisson process and processor sharing scheduling is used. Both bounds and asymptotics for the optimal speeds are provided. Interestingly, a simple scheme that halts when the system is idle and uses a static rate while the system is busy provides nearly the same performance as the optimal dynamic speed scaling. However, dynamic speed scaling which allocates a higher speed when more jobs are present significantly improves robustness to bursty traffic and mis-estimation of workload parameters
Speedup stacks: identifying scaling Bottlenecks in multi-threaded applications
Multi-threaded workloads typically show sublinear speedup on multi-core hardware, i.e., the achieved speedup is not proportional to the number of cores and threads. Sublinear scaling may have multiple causes, such as poorly scalable synchronization leading to spinning and/or yielding, and interference in shared resources such as the lastlevel cache (LLC) as well as the main memory subsystem. It is vital for programmers and processor designers to understand scaling bottlenecks in existing and emerging workloads in order to optimize application performance and design future hardware. In this paper, we propose the speedup stack, which quantifies the impact of the various scaling delimiters on multithreaded application speedup in a single stack. We describe a mechanism for computing speedup stacks on a multi-core processor, and we find speedup stacks to be accurate within 5.1% on average for sixteen-threaded applications. We present several use cases: we discuss how speedup stacks can be used to identify scaling bottlenecks, classify benchmarks, optimize performance, and understand LLC performance
- …