1,250 research outputs found
Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore Architectures
In this paper, we perform an empirical evaluation of the Parallel External
Memory (PEM) model in the context of geometric problems. In particular, we
implement the parallel distribution sweeping framework of Ajwani, Sitchinava
and Zeh to solve batched 1-dimensional stabbing max problem. While modern
processors consist of sophisticated memory systems (multiple levels of caches,
set associativity, TLB, prefetching), we empirically show that algorithms
designed in simple models, that focus on minimizing the I/O transfers between
shared memory and single level cache, can lead to efficient software on current
multicore architectures. Our implementation exhibits significantly fewer
accesses to slow DRAM and, therefore, outperforms traditional approaches based
on plane sweep and two-way divide and conquer.Comment: Longer version of ESA'13 pape
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
Performance Debugging and Tuning using an Instruction-Set Simulator
Instruction-set simulators allow programmers a detailed level of insight into,
and control over, the execution of a program, including parallel programs and
operating systems. In principle, instruction set simulation can model any
target computer and gather any statistic. Furthermore, such simulators are
usually portable, independent of compiler tools, and deterministic-allowing
bugs to be recreated or measurements repeated. Though often viewed as being
too slow for use as a general programming tool, in the last several years
their performance has improved considerably.
We describe SIMICS, an instruction set simulator of SPARC-based
multiprocessors developed at SICS, in its rôle as a general programming tool.
We discuss some of the benefits of using a tool such as SIMICS to support
various tasks in software engineering, including debugging, testing, analysis,
and performance tuning. We present in some detail two test cases, where we've
used SimICS to support analysis and performance tuning of two applications,
Penny and EQNTOTT. This work resulted in improved parallelism in, and
understanding of, Penny, as well as a performance improvement for EQNTOTT of
over a magnitude. We also present some early work on analyzing SPARC/Linux,
demonstrating the ability of tools like SimICS to analyze operating systems
- …