Search CORE

696 research outputs found

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Author: Elango Venmugil
Fauzia Naznin
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Ravishankar Mahesh
Rountev Atanas
Sadayappan P.
Publication venue
Publication date: 01/12/2013
Field of study

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

On Characterizing the Data Access Complexity of Programs

Author: Bilardi G.
Bilardi G.
Elango V.
Fuller S. H.
Savage J. E.
Scquizzato M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/11/2014
Field of study

Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Recommended from our members

Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures

Author: Panda Reena
Publication venue
Publication date: 21/03/2018
Field of study

Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models. The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads. Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data. This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support. Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies. Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.Electrical and Computer Engineerin

Texas ScholarWorks

On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

Author: Elango Venmugil
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Sadayappan P.
Publication venue
Publication date: 01/01/2014
Field of study

Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Investigating Single Precision Floating General Matrix Multiply in Heterogeneous Hardware

Author: Harris Steven
Publication venue: Washington University Open Scholarship
Publication date: 01/08/2020
Field of study

The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several of the most notable matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on this and future heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were for existing architectures when used on emerging heterogeneous systems

Washington University St. Louis: Open Scholarship

Multi-Device Controllers: A Library To Simplify The Parallel Heterogeneous Programming

Author: González Escribano Arturo
Llanos Ferraris Diego Rafael
Moretón Fernández Ana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Producción CientíficaCurrent HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming, mixing accelerators and CPU cores. However, in many cases, portability compromises the efficiency on different devices, and there are details concerning the coordination of different types of devices that should still be tackled by the programmer. In this work, we introduce the Multi-Controller, an abstract entity implemented in a library that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPU-cores. Our proposal improves state-of-the-art solutions, simplifying data partition, mapping and the transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA’s GPUs, or OpenMP for CPU-cores). The run-time system automatically selects and deploys the most appropriate implementation of each kernel for each device, managing data movements and hiding the launch details. The results of an experimental study with five study cases indicates that our abstraction allows the development of flexible and highly efficient programs that adapt to the heterogeneous environment.2020-01-012020-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

Repositorio Documental de la Universidad de Valladolid

Cache-conscious Splitting of MapReduce Tasks and its Application to Stencil Computations

Author: Magro Daniel Lobato Vieira
Publication venue
Publication date: 01/01/2015
Field of study

Modern cluster systems are typically composed by nodes with multiple processing units and memory hierarchies comprising multiple cache levels of various sizes. To leverage the full potential of these architectures it is necessary to explore concepts such as parallel programming and the layout of data onto the memory hierarchy. However, the inherent complexity of these concepts and the heterogeneity of the target architectures raises several challenges at application development and performance portability levels, respectively. In what concerns parallel programming, several model and frameworks are available, of which MapReduce [16] is one of the most popular. It was developed at Google [16] for the parallel and distributed processing of large amounts of data in large clusters of commodity machines. Although being very powerful tools, the reference MapReduce frameworks, such as Hadoop and Spark, do not leverage the characteristics of the underlying memory hierarchy. This shortcoming is particularly noticeable in computations that benefit from temporal locality, such as stencil computations. In this context, the goal of this thesis is to improve the performance of MapReduce computations that benefit from temporal locality. To that end we optimize the mapping of MapReduce computations in a machine’s cache memory hierarchy by applying cacheaware tiling techniques. We prototyped our solution on top of the framework Hadoop MapReduce, incorporating a cache-awareness in the splitting stage. To validate our solution and assess its benefits, we developed an API for expressing stencil computations on top the developed framework. The experimental results show that, for a typical stencil computation, our solution delivers an average speed-up of 1.77 while reaching a peek speed-up of 3.2. These findings allows us to conclude that cacheaware decomposition of MapReduce computations considerably boosts the execution of this class of MapReduce computations

Repositório da Universidade Nova de Lisboa