11,256 research outputs found
Boosting Multi-Core Reachability Performance with Shared Hash Tables
This paper focuses on data structures for multi-core reachability, which is a
key component in model checking algorithms and other verification methods. A
cornerstone of an efficient solution is the storage of visited states. In
related work, static partitioning of the state space was combined with
thread-local storage and resulted in reasonable speedups, but left open whether
improvements are possible. In this paper, we present a scaling solution for
shared state storage which is based on a lockless hash table implementation.
The solution is specifically designed for the cache architecture of modern
CPUs. Because model checking algorithms impose loose requirements on the hash
table operations, their design can be streamlined substantially compared to
related work on lockless hash tables. Still, an implementation of the hash
table presented here has dozens of sensitive performance parameters (bucket
size, cache line size, data layout, probing sequence, etc.). We analyzed their
impact and compared the resulting speedups with related tools. Our
implementation outperforms two state-of-the-art multi-core model checkers (SPIN
and DiVinE) by a substantial margin, while placing fewer constraints on the
load balancing and search algorithms.Comment: preliminary repor
Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor
With the ease-of-programming, flexibility and yet efficiency, MapReduce has
become one of the most popular frameworks for building big-data applications.
MapReduce was originally designed for distributed-computing, and has been
extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In
this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is
the latest product released by Intel based on the Many Integrated Core
Architecture. To the best of our knowledge, this is the first work to optimize
the MapReduce framework on the Xeon Phi.
In our work, we utilize advanced features of the Xeon Phi to achieve high
performance. In order to take advantage of the SIMD vector processing units, we
propose a vectorization friendly technique for the map phase to assist the
auto-vectorization as well as develop SIMD hash computation algorithms.
Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to
improve the resource utilization. We also eliminate multiple local arrays but
use low cost atomic operations on the global array for some applications, which
can improve the thread scalability and data locality due to the coherent L2
caches. Finally, for a given application, our framework can either
automatically detect suitable techniques to apply or provide guideline for
users at compilation time. We conduct comprehensive experiments to benchmark
the Xeon Phi and compare our optimized MapReduce framework with a
state-of-the-art multi-core based MapReduce framework (Phoenix++). By
evaluating six real-world applications, the experimental results show that our
optimized framework is 1.2X to 38X faster than Phoenix++ for various
applications on the Xeon Phi
Scalability of broadcast performance in wireless network-on-chip
Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
On the Impact of Memory Allocation on High-Performance Query Processing
Somewhat surprisingly, the behavior of analytical query engines is crucially
affected by the dynamic memory allocator used. Memory allocators highly
influence performance, scalability, memory efficiency and memory fairness to
other processes. In this work, we provide the first comprehensive experimental
analysis on the impact of memory allocation for high-performance query engines.
We test five state-of-the-art dynamic memory allocators and discuss their
strengths and weaknesses within our DBMS. The right allocator can increase the
performance of TPC-DS (SF 100) by 2.7x on a 4-socket Intel Xeon server
Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies
In many scientific and engineering applications, one has to solve not one but
a sequence of instances of the same problem. Often times, the problems in the
sequence are linked in a way that allows intermediate results to be reused. A
characteristic example for this class of applications is given by the
Genome-Wide Association Studies (GWAS), a widely spread tool in computational
biology. GWAS entails the solution of up to trillions () of correlated
generalized least-squares problems, posing a daunting challenge: the
performance of petaflops ( floating-point operations) over terabytes
of data.
In this paper, we design an algorithm for performing GWAS on multi-core
architectures. This is accomplished in three steps. First, we show how to
exploit the relation among successive problems, thus reducing the overall
computational complexity. Then, through an analysis of the required data
transfers, we identify how to eliminate any overhead due to input/output
operations. Finally, we study how to decompose computation into tasks to be
distributed among the available cores, to attain high performance and
scalability. With our algorithm, a GWAS that currently requires the use of a
supercomputer may now be performed in matter of hours on a single multi-core
node.
The discussion centers around the methodology to develop the algorithm rather
than the specific application. We believe the paper contributes valuable
guidelines of general applicability for computational scientists on how to
develop and optimize numerical algorithms
BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures
We introduce BriskStream, an in-memory data stream processing system (DSPSs)
specifically designed for modern shared-memory multicore architectures.
BriskStream's key contribution is an execution plan optimization paradigm,
namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair
of producer-consumer operators into consideration. We propose a branch and
bound based approach with three heuristics to resolve the resulting nontrivial
optimization problem. The experimental evaluations demonstrate that BriskStream
yields much higher throughput and better scalability than existing DSPSs on
multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1
- …