39 research outputs found

    Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in International Journal of Parallel Programming. The final authenticated version is available online at: https://doi.org/10.1007/s10766-015-0362-9[Abstract] The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.Ministerio de EconomĂ­a y Competitividad; TIN2010-16735Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PGalicia, ConsellerĂ­a de Cultura, EducaciĂłn e OrdenaciĂłn Universitaria; GRC2013-055Ministerio de EducaciĂłn; AP2008-0101

    GĂ©nĂ©ralisation de l’analyse de performance dĂ©crĂ©mentale vers l’analyse diffĂ©rentielle

    Get PDF
    A crucial step in the process of application performance analysis is the accurate detection of program bottlenecks. A bottleneck is any event which contributes to extend the execution time. Determining their cause is important for application developpers as it enable them to detect code design and generation flaws.Bottleneck detection is becoming a difficult art. Techniques such as event counts,which succeeded to find bottlenecks easily in the past, became less efficient because of the increasing complexity of modern micro-processors, and because of the introduction of parallelism at several levels. Consequently, a real need for new analysis approaches is present in order to face these challenges.Our work focuses on performance analysis and bottleneck detection of computeintensive loops in scientific applications. We work on Decan, a performance analysis and bottleneck detection tool, which offers an interesting and promising approach called Decremental Analysis. The tool, which operates at binary level, is based on the idea of performing controlled modifications on the instructions of a loop, and comparing the new version (called variant) to the original one. The goal is to assess the cost of specific events, and thus the existence or not of bottlenecks.Our first contribution, consists of extending Decan with new variants that we designed, tested and validated. Based on these variants, we developed analysis methods which we used to characterize hot loops and find their bottlenecks. Welater, integrated the tool into a performance analysis methodology (Pamda) which coordinates several analysis tools in order to achieve a more efficient application performance analysis.Second, we introduce several improvements on the Decan tool. Techniquesdeveloped to preserve the control flow of the modified programs, allowed to use thetool on real applications instead of extracted kernels. Support for parallel programs(thread and process based) was also added. Finally, our tool primarily relying on execution time as the main concern for its analysis process, we study the opportunity of also using other hardware generated events, through a study of their stability, precision and overheadUne des Ă©tapes les plus cruciales dans le processus d’analyse des performances d’une application est la dĂ©tection des goulets d’étranglement. Un goulet Ă©tant tout Ă©vĂšnement qui contribue Ă  l’allongement temps d’exĂ©cution, la dĂ©tection de ses causes est importante pour les dĂ©veloppeurs d’applications afin de comprendre les dĂ©fauts de conception et de gĂ©nĂ©ration de code. Cependant, la dĂ©tection de goulets devient un art difficile. Dans le passĂ©, des techniques qui reposaient sur le comptage du nombre d’évĂšnements, arrivaient facilement Ă  trouver les goulets. Maintenant, la complexitĂ© accrue des micro-architectures modernes et l’introduction de plusieurs niveaux de parallĂ©lisme ont rendu ces techniques beaucoup moins efficaces. Par consĂ©quent, il y a un rĂ©el besoin de rĂ©flexion sur de nouvelles approches.Notre travail porte sur le dĂ©veloppement d’outils d’évaluation de performance des boucles de calculs issues d’applications scientifiques. Nous travaillons sur Decan, un outil d’analyse de performance qui prĂ©sente une approche intĂ©ressante et prometteuse appelĂ©e l’Analyse DĂ©crĂ©mentale. Decan repose sur l’idĂ©e d’effectuer des changements contrĂŽlĂ©s sur les boucles du programme et de comparer la version obtenue (appelĂ©e variante) avec la version originale, permettant ainsi de dĂ©tecter la prĂ©sence ou pas de goulets d’étranglement.Tout d’abord, nous avons enrichi Decan avec de nouvelles variantes, que nous avons conçues, testĂ©es et validĂ©es. Ces variantes sont, par la suite, intĂ©grĂ©es dans une analyse de performance poussĂ©e appelĂ©e l’Analyse DiffĂ©rentielle. Nous avons intĂ©grĂ© l’outil et l’analyse dans une mĂ©thodologie d’analyse de performance plus globale appelĂ©e Pamda.Nous dĂ©crirons aussi les diffĂ©rents apports Ă  l’outil Decan. Sont particuliĂšrement dĂ©taillĂ©es les techniques de prĂ©servation des structures de contrĂŽle du programme,ainsi que l’ajout du support pour les programmes parallĂšles.Finalement, nous effectuons une Ă©tude statistique qui permet de vĂ©rifier la possibilitĂ© d’utiliser des compteurs d’évĂšnements, autres que le temps d’exĂ©cution, comme mĂ©triques de comparaison entre les variantes Deca

    Architecting, programming, and evaluating an on-chip incoherent multi-processor memory hierarchy

    Get PDF
    New architectures for extreme-scale computing need to be designed for higher energy efficiency than current systems. The DOE-funded Traleika Glacier architecture is a recently-proposed extreme-scale manycore that radically simplifies the architecture, and proposes a cluster-based on-chip memory hierarchy without hardware cache coherence. Programming for such an environment, which can use scratchpads or incoherent caches, is challenging. Hence, this thesis focuses on architecting, programming, and evaluating an on-chip incoherent multiprocessor memory hierarchy. This thesis starts by examining incoherent multiprocessor caches. It proposes ISA support for data movement in such an environment, and two relatively user-friendly programming approaches that use the ISA. The ISA support is largely based on writeback and self-invalidation instructions, while the programming approaches involve shared-memory programming either inside a cluster only, or across clusters. The thesis also includes compiler transformations for such an incoherent cache hierarchy. Our simulation results show that, with our approach, the execution of applications on incoherent cache hierarchies can deliver reasonable performance. For execution within a cluster, the average execution time of our applications is only 2% higher than with hardware cache coherence. For execution across multiple clusters, our applications run on average 20% faster than a naive scheme that pushes all the data to the last-level shared cache. Compiler transformations for both regular and irregular applications are shown to deliver substantial performance increases. This thesis then considers scratchpads. It takes the design in the Traleika Glacier architecture and performs a simulation-based evaluation. It shows how the hardware exploits available concurrency from parallel applications. However, it also shows the limitations of the current software stack, which lacks smart memory management and high-level hints for the scheduler

    Combiner approches statique et dynamique pour modéliser la performance de boucles HPC

    Get PDF
    The complexity of CPUs has increased considerably since their beginnings, introducing mechanisms such as register renaming, out-of-order execution, vectorization,prefetchers and multi-core environments to keep performance rising with each product generation. However, so has the difficulty in making proper use of all these mechanisms, or even evaluating whether one’s program makes good use of a machine,whether users’ needs match a CPU’s design, or, for CPU architects, knowing how each feature really affects customers.This thesis focuses on increasing the observability of potential bottlenecks inHPC computational loops and how they relate to each other in modern microarchitectures.We will first introduce a framework combining CQA and DECAN (respectively static and dynamic analysis tools) to get detailed performance metrics on smallcodelets in various execution scenarios.We will then present PAMDA, a performance analysis methodology leveraging elements obtained from codelet analysis to detect potential performance problems in HPC applications and help resolve them. A work extending the Cape linear model to better cover Sandy Bridge and give it more flexibility for HW/SW codesign purposes will also be described. It will bedirectly used in VP3, a tool evaluating the performance gains vectorizing loops could provide.Finally, we will describe UFS, an approach combining static analysis and cycle accurate simulation to very quickly estimate a loop’s execution time while accounting for out-of-order limitations in modern CPUsLa complexitĂ© des CPUs s’est accrue considĂ©rablement depuis leurs dĂ©buts, introduisant des mĂ©canismes comme le renommage de registres, l’exĂ©cution dans le dĂ©sordre, la vectorisation, les prĂ©fetchers et les environnements multi-coeurs pour amĂ©liorer les performances avec chaque nouvelle gĂ©nĂ©ration de processeurs. Cependant, la difficultĂ© a suivi la mĂȘme tendance pour ce qui est a) d’utiliser ces mĂȘmes mĂ©canismes Ă  leur plein potentiel, b) d’évaluer si un programme utilise une machine correctement, ou c) de savoir si le design d’un processeur rĂ©pond bien aux besoins des utilisateurs.Cette thĂšse porte sur l’amĂ©lioration de l’observabilitĂ© des facteurs limitants dans les boucles de calcul intensif, ainsi que leurs interactions au sein de microarchitectures modernes.Nous introduirons d’abord un framework combinant CQA et DECAN (des outils d’analyse respectivement statique et dynamique) pour obtenir des mĂ©triques dĂ©taillĂ©es de performance sur des petits codelets et dans divers scĂ©narios d’exĂ©cution.Nous prĂ©senterons ensuite PAMDA, une mĂ©thodologie d’analyse de performance tirant partie de l’analyse de codelets pour dĂ©tecter d’éventuels problĂšmes de performance dans des applications de calcul Ă  haute performance et en guider la rĂ©solution.Un travail permettant au modĂšle linĂ©aire Cape de couvrir la microarchitecture Sandy Bridge de façon dĂ©taillĂ©e sera dĂ©crit, lui donnant plus de flexibilitĂ© pour effectuer du codesign matĂ©riel / logiciel. Il sera mis en pratique dans VP3, un outil Ă©valuant les gains de performance atteignables en vectorisant des boucles.Nous dĂ©crirons finalement UFS, une approche combinant analyse statique et simulation au cycle prĂšs pour permettre l’estimation rapide du temps d’exĂ©cution d’une boucle en prenant en compte certaines des limites de l’exĂ©cution en dĂ©sordre dans des microarchitectures moderne

    Matching non-uniformity for program optimizations on heterogeneous many-core systems

    Get PDF
    As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns

    Studies on automatic parallelization for heterogeneous and homogeneous multicore processors

    Get PDF
    戶ćșŠ:新 ; 栱摊ç•Șć·:ç”Č3537ć· ; ć­ŠäœăźçšźéĄž:ćšćŁ«(ć·„ć­Š) ; 授䞎ćčŽæœˆæ—„:2012/2/25 ; æ—©ć€§ć­Šäœèš˜ç•Șć·:新587

    Portability and performance in heterogeneous many core Systems

    Get PDF
    Dissertação de mestrado em InformåticaCurrent computing systems have a multiplicity of computational resources with different architectures, such as multi-core CPUs and GPUs. These platforms are known as heterogeneous many-core systems (HMS) and as computational resources evolve they are o ering more parallelism, as well as becoming more heterogeneous. Exploring these devices requires the programmer to be aware of the multiplicity of associated architectures, computing models and development framework. Portability issues, disjoint memory address spaces, work distribution and irregular workload patterns are major examples that need to be tackled in order to e ciently explore the computational resources of an HMS. This dissertation goal is to design and evaluate a base architecture that enables the identi cation and preliminary evaluation of the potential bottlenecks and limitations of a runtime system that addresses HMS. It proposes a runtime system that eases the programmer burden of handling all the devices available in a heterogeneous system. The runtime provides a programming and execution model with a uni ed address space managed by a data management system. An API is proposed in order to enable the programmer to express applications and data in an intuitive way. Four di erent scheduling approaches are evaluated that combine di erent data partitioning mechanisms with di erent work assignment policies and a performance model is used to provide some performance insights to the scheduler. The runtime e ciency was evaluated with three di erent applications - matrix multiplication, image convolution and n-body Barnes-Hut simulation - running in multicore CPUs and GPUs. In terms of productivity the results look promising, however, combining scheduling and data partitioning revealed some ine ciencies that compromise load balancing and needs to be revised, as well as the data management system that plays a crucial role in such systems. Performance model driven decisions were also evaluated which revealed that the accuracy of a performance model is also a compromising component
    corecore