72,037 research outputs found
Using fast and accurate simulation to explore hardware/software trade-offs in the multi-core era
Writing well-performing parallel programs is challenging in the multi-core processor era. In addition to achieving good per-thread performance, which in itself is a balancing act between instruction-level parallelism, pipeline effects and good memory performance, multi-threaded programs complicate matters even further. These programs require synchronization, and are affected by the interactions between threads through sharing of both processor resources and the cache hierarchy.
At the Intel Exascience Lab, we are developing an architectural simulator called Sniper for simulating future exascale-era multi-core processors. Its goal is twofold: Sniper should assist hardware designers to make design decisions, while simultaneously providing software designers with a tool to gain insight into the behavior of their algorithms and allow for optimization. By taking architectural features into account, our simulator can provide more insight into parallel programs than what can be obtained from existing performance analysis tools. This unique combination of hardware simulator and software performance analysis tool makes Sniper a useful tool for a simultaneous exploration of the hardware and software design space for future high-performance multi-core systems
Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies
In many scientific and engineering applications, one has to solve not one but
a sequence of instances of the same problem. Often times, the problems in the
sequence are linked in a way that allows intermediate results to be reused. A
characteristic example for this class of applications is given by the
Genome-Wide Association Studies (GWAS), a widely spread tool in computational
biology. GWAS entails the solution of up to trillions () of correlated
generalized least-squares problems, posing a daunting challenge: the
performance of petaflops ( floating-point operations) over terabytes
of data.
In this paper, we design an algorithm for performing GWAS on multi-core
architectures. This is accomplished in three steps. First, we show how to
exploit the relation among successive problems, thus reducing the overall
computational complexity. Then, through an analysis of the required data
transfers, we identify how to eliminate any overhead due to input/output
operations. Finally, we study how to decompose computation into tasks to be
distributed among the available cores, to attain high performance and
scalability. With our algorithm, a GWAS that currently requires the use of a
supercomputer may now be performed in matter of hours on a single multi-core
node.
The discussion centers around the methodology to develop the algorithm rather
than the specific application. We believe the paper contributes valuable
guidelines of general applicability for computational scientists on how to
develop and optimize numerical algorithms
Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures
Generalized linear mixed-effects models in the context of genome-wide
association studies (GWAS) represent a formidable computational challenge: the
solution of millions of correlated generalized least-squares problems, and the
processing of terabytes of data. We present high performance in-core and
out-of-core shared-memory algorithms for GWAS: By taking advantage of
domain-specific knowledge, exploiting multi-core parallelism, and handling data
efficiently, our algorithms attain unequalled performance. When compared to
GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor
we obtain 50-fold speedups. As a consequence, our routines enable genome
studies of unprecedented size
Characterizing and Subsetting Big Data Workloads
Big data benchmark suites must include a diversity of data and workloads to
be useful in fairly evaluating big data systems and architectures. However,
using truly comprehensive benchmarks poses great challenges for the
architecture community. First, we need to thoroughly understand the behaviors
of a variety of workloads. Second, our usual simulation-based research methods
become prohibitively expensive for big data. As big data is an emerging field,
more and more software stacks are being proposed to facilitate the development
of big data applications, which aggravates hese challenges. In this paper, we
first use Principle Component Analysis (PCA) to identify the most important
characteristics from 45 metrics to characterize big data workloads from
BigDataBench, a comprehensive big data benchmark suite. Second, we apply a
clustering technique to the principle components obtained from the PCA to
investigate the similarity among big data workloads, and we verify the
importance of including different software stacks for big data benchmarking.
Third, we select seven representative big data workloads by removing redundant
ones and release the BigDataBench simulation version, which is publicly
available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload
Characterizatio
- …