20 research outputs found

    Automatic Generation of Models of Microarchitectures

    Get PDF
    Detailed microarchitectural models are necessary to predict, explain, or optimize the performance of software running on modern microprocessors. Building such models often requires a significant manual effort, as the documentation provided by hardware manufacturers is typically not precise enough. The goal of this thesis is to develop techniques for generating microarchitectural models automatically. In the first part, we focus on recent x86 microarchitectures. We implement a tool to accurately evaluate small microbenchmarks using hardware performance counters. We then describe techniques to automatically generate microbenchmarks for measuring the performance of individual instructions and for characterizing cache architectures. We apply our implementations to more than a dozen different microarchitectures. In the second part of the thesis, we study more general techniques to obtain models of hardware components. In particular, we propose the concept of gray-box learning, and we develop a learning algorithm for Mealy machines that exploits prior knowledge about the system to be learned. Finally, we show how this algorithm can be adapted to minimize incompletely specified Mealy machines—a well-known NP-complete problem. Our implementation outperforms existing exact minimization techniques by several orders of magnitude on a number of hard benchmarks; it is even competitive with state-of-the-art heuristic approaches.Zur Vorhersage, Erklärung oder Optimierung der Leistung von Software auf modernen Mikroprozessoren werden detaillierte Modelle der verwendeten Mikroarchitekturen benötigt. Das Erstellen derartiger Modelle ist oft mit einem hohen Aufwand verbunden, da die erforderlichen Informationen von den Prozessorherstellern typischerweise nicht zur Verfügung gestellt werden. Das Ziel der vorliegenden Arbeit ist es, Techniken zu entwickeln, um derartige Modelle automatisch zu erzeugen. Im ersten Teil beschäftigen wir uns mit aktuellen x86-Mikroarchitekturen. Wir entwickeln zuerst ein Tool, das kleine Microbenchmarks mithilfe von Performance Countern auswerten kann. Danach beschreiben wir Techniken, um automatisch Microbenchmarks zu erzeugen, mit denen die Leistung einzelner Instruktionen gemessen sowie die Cache-Architektur charakterisiert werden kann. Im zweiten Teil der Arbeit betrachten wir allgemeinere Techniken, um Hardwaremodelle zu erzeugen. Wir schlagen das Konzept des “Gray-Box Learning” vor, und wir entwickeln einen Lernalgorithmus für Mealy-Maschinen, der bekannte Informationen über das zu lernende System berücksichtigt. Zum Abschluss zeigen wir, wie dieser Algorithmus auf das Problem der Minimierung unvollständig spezifizierter Mealy-Maschinen übertragen werden kann. Hierbei handelt es sich um ein bekanntes NP-vollständiges Problem. Unsere Implementierung ist in mehreren Benchmarks um Größenordnungen schneller als vorherige Ansätze

    Heterogeneous system and application communication modeling

    Get PDF
    With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands

    Heterogeneous system and application communication modeling

    Get PDF
    With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands

    Efficient Characterization of Hidden Processor Memory Hierarchies

    Full text link
    A processor's memory hierarchy has a major impact on the performance of running code. However, computing platforms, where the actual hardware characteristics are hidden from both the end user and the tools that mediate execution, such as a compiler, a JIT and a runtime system, are used more and more, for example, performing large scale computation in cloud and cluster. Even worse, in such environments, a single computation may use a collection of processors with dissimilar characteristics. Ignorance of the performance-critical parameters of the underlying system makes it difficult to improve performance by optimizing the code or adjusting runtime-system behaviors; it also makes application performance harder to understand. To address this problem, we have developed a suite of portable tools that can efficiently derive many of the parameters of processor memory hierarchies, such as levels, effective capacity and latency of caches and TLBs, in a matter of seconds. The tools use a series of carefully considered experiments to produce and analyze cache response curves automatically. The tools are inexpensive enough to be used in a variety of contexts that may include install time, compile time or runtime adaption, or performance understanding tools.Comment: 14 pages, International Conference on Computational Science 201

    On the Overhead of Topology Discovery for Locality-aware Scheduling in HPC

    Get PDF
    International audienceThe increasing complexity of parallel computing platforms requires a deep knowledge of the hardware and of the application needs. Locality a key criteria for performance optimization. It involves software tools to expose information about the hardware topology to high performance runtime libraries. We show that the overhead of gathering such information from the operating system is significant on large computing nodes that run Linux. This overhead also increases more than linearly with the number of processes that perform it simultaneously. We then study the actual needs of the HPC software ecosystem in terms of topology information. We propose some ways to avoid multiple expensive topology discovery and to share topology information between components such as the resource manager or the runtime libraries

    Towards the Structural Modeling of the Topology of next-generation heterogeneous cluster Nodes with hwloc

    Get PDF
    Parallel computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA memory interconnects, as well as asymmetric I/O access. Upcoming architectures will add a heterogeneous memory subsystem with non-volatile and/or high-bandwidth memory banks. Parallel applications developers have to take locality into account before they can expect good efficiency on these platforms. Thus there is a strong need for a portable tool gathering and exposing this information. The Hardware Locality project (hwloc) offers a tree representation of the hardware based on the inclusion of CPU resources and localities of memory and I/O devices. It is already widely used for affinity-based task placement in high performance computing. We present how hwloc represents parallel computing nodes, from the hierarchy of computing and memory resources to I/O device locality. It builds a structural model of the hardware to help application find the best resources fitting their needs. hwloc also annotates objects to ease identification of resources from different programming points of view. We finally describe how it helps process managers and batch schedulers to deal with the topology of multiple cluster nodes, by offering different compression techniques for better management of thousands of nodes

    Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications

    Get PDF
    International audienceHigh-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. Therefore locality is a major area of optimization on the road to exascale. Indeed, tasks and data have to be carefully distributed on the computing and memory resources.We discuss the current way to expose processor and memory locality information in the Linux kernel and in user-space libraries such as the hwloc software project. The current de facto standard structural modeling of the platform as the tree is not perfect, but it offers a good compromise between precision and convenience for HPC runtimes.We present an in-depth study of the software view of the upcoming Intel Knights Landing processor. Its memory locality cannot be properly exposed to user-space applications without a significant rework of the current software stack. We propose an extension of the current hierarchical platform model in hwloc. It correctly exposes new heterogeneous architectures with high-bandwidth or non-volatile memories to applications, while still being convenient for affinity-aware HPC runtimes

    Software Performance Engineering using Virtual Time Program Execution

    Get PDF
    In this thesis we introduce a novel approach to software performance engineering that is based on the execution of code in virtual time. Virtual time execution models the timing-behaviour of unmodified applications by scaling observed method times or replacing them with results acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and performance models. The ability to analyse code and models in a single framework enables performance testing throughout the software lifecycle, without the need to to extract performance models from code. This is accomplished by forcing thread scheduling decisions to take into account the hypothetical time-scaling or model-based performance specifications of each method. The virtual time execution of I/O operations or multicore targets is also investigated. We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance predictions for multi-threaded applications. The language-independent VEX core is driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment (JINE), demonstrating the challenges involved in virtual time execution at the JVM level. We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time and identifying the causes of deviations from observed real times. Our results show that VEX and JINE transparently provide predictions for the response time of unmodified applications with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional time). We conclude this thesis with a case study that shows how models and code can be integrated, thus illustrating our vision on how virtual time execution can support performance testing throughout the software lifecycle

    The Servet 3.0 benchmark suite: characterization of network performance degradation

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Computers & Electrical Engineering. The final authenticated version is available online at: https://doi.org/10.1016/j.compeleceng.2013.08.012.[Abstract] Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Ministerio de EducaciĂłn; AP2008-01578Ministerio de EducaciĂłn; AP2010-4348European Commision; HPC-Europa2 Programme; 22839

    Compilation techniques and language support to facilitate dependence-driven computation

    Get PDF
    As the demand increases for high performance and power efficiency in modern computer runtime systems and architectures, programmers are left with the daunting challenge of fully exploiting these systems for efficiency, high-level expressibility, and portability across different computing architectures. Emerging programming models such as the task-based runtime StarPU and many-core architectures such as GPUs force programmers into choosing either low-level programming languages or putting complete faith in the compiler. As has been previously studied in extensive detail, both development approaches have their own respective trade-offs. The goal of this thesis is to help make parallel programming easier. It addresses these challenges by providing new compilation techniques for high-level programming languages that conform to commonly-accepted paradigms in order to leverage these emerging runtime systems and architectures. In particular, this dissertation makes several contributions to these challenges by leveraging the high-level programming language Chapel in order to efficiently map computation and data onto both the task-based runtime system StarPU and onto GPU-based accelerators. Different loop-based parallel programs and experiments are evaluated in order to measure the effectiveness of the proposed compiler algorithms and their optimizations, while also providing programmability metrics when leveraging high-level languages. In order to exploit additional performance when mapping onto shared memory systems, this thesis proposes a set of compiler and runtime-based heuristics that determine the profitable processor tile shapes and sizes when mapping multiply-nested parallel loops. Finally, a new benchmark-suite named P-Ray is presented. This is used to provide machine characteristics in a portable manner that can be used by either a compiler, an auto-tuning framework, or the programmer when optimizing their applications
    corecore