141 research outputs found
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
A methodology for analyzing commercial processor performance numbers
The wealth of performance numbers provided by benchmarking corporations makes it difficult to detect trends across commercial machines. A proposed methodology, based on statistical data analysis, simplifies exploration of these machines' large datasets
Mechanistic analytical modeling of superscalar in-order processor performance
Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor.
The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
ShenZhen transportation system (SZTS): a novel big data benchmark suite
Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads, however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose ShenZhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as HiBench and CloudRank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at the microarchitecture level, the operating system (OS) level, and the job level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and we propose a methodology for identifying representative input data sets
Porting and tuning of the Mont-Blanc benchmarks to the multicore ARM 64bit architecture
This project is about porting and tuning the Mont-Blanc benchmarks to the multicore ARM
64 bits architecture. The Mont-Blanc benchmarks are part of the Mont-Blanc European
project and they have been developed internally in the BSC (Barcelona Supercomputing
Center).
The project will explore the possibilities that an ARM architecture can offer running in a
HPC (High Performance Computing) setup, this includes to learn how to tune and adapt a
parallelized computer program and analyze its execution behavior.
As part of the project, we will analyze the performance of each benchmark using instrumentation
tools such like Extrae and Paraver. Each benchmark will be adapted, tuned and
executed mainly in the three new Mont-Blanc mini-clusters, Thunder (ARMv8 custom),
Merlin (ARMv8 custom) and Jetson TX (ARMv8 cortex-a57) using the OmpSs programming
model. The evolution of the performance obtained will be shown followed by a brief analysis
of the results after each optimization.Aquest projecte es basa en adaptar i afinar els Mont-Blanc benchmarks a l’arquitectura
multinucli ARM 64 bits. Els Mont-Blanc benchmarks formen part del projecte Europeu
Mont-Blanc i han estat desenvolupats internament en el BSC (Barcelona Supercomputing
Center).
Aquest projecte explorarà el potencial d’usar l’arquitectura ARM en un entorn HPC (High
Performance Computing), això inclou aprendre a adaptar i afinar un programa paral·lel, i
analitzar el seu comportament durant l’execució.
Com a part del projecte, s’analitzarà el rendiment de cada benchmark usant eines d’instrumentació
com Extrae o Paraver. Cada benchmark serà adaptat, afinat i executat en els tres nous miniclústers
de Mont-Blanc, Thunder (ARMv8 personalitzat), Merlin (ARMv8 personalitzat)
i Jetson TX (ARMv8 cortex-a57) usant el model de programació OmpSs. Es mostrarà
l’evolució del rendiment, seguit d’una breu explicació dels resultats després de cada optimització.Este proyecto se basa en adaptar y afinar los Mont-blanc benchmarks a la arquitectura
multi-núcleo ARM 64 bits. Los Mont-Blanc benchmarks forman parte del proyecto Europeo
Mont-Blanc y han sido desarrollados internamente en el BSC (Barcelona Supercomputing
Center).
Este proyecto explorará el potencial de usar la arquitectura ARM en un entorno HPC (High
Performance Computing), esto incluye aprender a adaptar y afinar un programa paralelo, y
analizar su comportamiento durante la ejecución.
Como parte del proyecto, se analizará el rendimiento de cada benchmark usando herramientas
de instrumentación como Extrae o Paraver. Cada benchmark será adaptado, afinado y
ejecutado en los tres nuevos mini-clústeres de Mont-Blanc, Thunder (ARMv8 personalizado),
Merlin (ARMv8 personalizado) y Jetson TX (ARMv8 cortex-a57) usando el modelo de
programación OmpSs. Se mostrará la evolución del rendimiento obtenido, y una breve
explicación de los resultados después de cada optimización
Processor design space exploration and performance prediction
The use of simulation is well established in processor design research to evaluate architectural design trade-offs. More importantly, Cycle by Cycle accurate simulation is widely used to evaluate the new designs in processor research because of its accurate and detailed processor performance measurement. However, only configuration in a subspace can be simulated in practice due to its long simulation time and limited resources, leading to suboptimal conclusions that might not be applied to the larger design space. In this thesis, we propose a performance prediction approach which employs a state-of-the-art technique from experimental design, machine learning and data mining. Our model can be trained initially by using Cycle by Cycle accurate simulation results, and then it can be implemented to predict the processor performance of the entire design space. According to our experiments, our model predicts the performance of a single-core processor with median percentage error ranging from 0.32% to 3.01% for about 15 million design spaces by using only 5000 initial independently sampled design points as a training set. In CMP the median percentage error ranges from 0.50% to 1.47% for about 9.7 million design spaces by using only 5000 independently sampled CMP design points as a training set. Apart from this, the model also provides quantitative interpretation tools such as variable importance and partial dependence of the design parameters
- …