39 research outputs found
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
This is a post-peer-review, pre-copyedit version of an article published in International Journal of Parallel Programming. The final authenticated version is available online at: https://doi.org/10.1007/s10766-015-0362-9[Abstract] The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.Ministerio de EconomĂa y Competitividad; TIN2010-16735Ministerio de EconomĂa y Competitividad; TIN2013-42148-PGalicia, ConsellerĂa de Cultura, EducaciĂłn e OrdenaciĂłn Universitaria; GRC2013-055Ministerio de EducaciĂłn; AP2008-0101
GĂ©nĂ©ralisation de lâanalyse de performance dĂ©crĂ©mentale vers lâanalyse diffĂ©rentielle
A crucial step in the process of application performance analysis is the accurate detection of program bottlenecks. A bottleneck is any event which contributes to extend the execution time. Determining their cause is important for application developpers as it enable them to detect code design and generation flaws.Bottleneck detection is becoming a difficult art. Techniques such as event counts,which succeeded to find bottlenecks easily in the past, became less efficient because of the increasing complexity of modern micro-processors, and because of the introduction of parallelism at several levels. Consequently, a real need for new analysis approaches is present in order to face these challenges.Our work focuses on performance analysis and bottleneck detection of computeintensive loops in scientific applications. We work on Decan, a performance analysis and bottleneck detection tool, which offers an interesting and promising approach called Decremental Analysis. The tool, which operates at binary level, is based on the idea of performing controlled modifications on the instructions of a loop, and comparing the new version (called variant) to the original one. The goal is to assess the cost of specific events, and thus the existence or not of bottlenecks.Our first contribution, consists of extending Decan with new variants that we designed, tested and validated. Based on these variants, we developed analysis methods which we used to characterize hot loops and find their bottlenecks. Welater, integrated the tool into a performance analysis methodology (Pamda) which coordinates several analysis tools in order to achieve a more efficient application performance analysis.Second, we introduce several improvements on the Decan tool. Techniquesdeveloped to preserve the control flow of the modified programs, allowed to use thetool on real applications instead of extracted kernels. Support for parallel programs(thread and process based) was also added. Finally, our tool primarily relying on execution time as the main concern for its analysis process, we study the opportunity of also using other hardware generated events, through a study of their stability, precision and overheadUne des Ă©tapes les plus cruciales dans le processus dâanalyse des performances dâune application est la dĂ©tection des goulets dâĂ©tranglement. Un goulet Ă©tant tout Ă©vĂšnement qui contribue Ă lâallongement temps dâexĂ©cution, la dĂ©tection de ses causes est importante pour les dĂ©veloppeurs dâapplications afin de comprendre les dĂ©fauts de conception et de gĂ©nĂ©ration de code. Cependant, la dĂ©tection de goulets devient un art difficile. Dans le passĂ©, des techniques qui reposaient sur le comptage du nombre dâĂ©vĂšnements, arrivaient facilement Ă trouver les goulets. Maintenant, la complexitĂ© accrue des micro-architectures modernes et lâintroduction de plusieurs niveaux de parallĂ©lisme ont rendu ces techniques beaucoup moins efficaces. Par consĂ©quent, il y a un rĂ©el besoin de rĂ©flexion sur de nouvelles approches.Notre travail porte sur le dĂ©veloppement dâoutils dâĂ©valuation de performance des boucles de calculs issues dâapplications scientifiques. Nous travaillons sur Decan, un outil dâanalyse de performance qui prĂ©sente une approche intĂ©ressante et prometteuse appelĂ©e lâAnalyse DĂ©crĂ©mentale. Decan repose sur lâidĂ©e dâeffectuer des changements contrĂŽlĂ©s sur les boucles du programme et de comparer la version obtenue (appelĂ©e variante) avec la version originale, permettant ainsi de dĂ©tecter la prĂ©sence ou pas de goulets dâĂ©tranglement.Tout dâabord, nous avons enrichi Decan avec de nouvelles variantes, que nous avons conçues, testĂ©es et validĂ©es. Ces variantes sont, par la suite, intĂ©grĂ©es dans une analyse de performance poussĂ©e appelĂ©e lâAnalyse DiffĂ©rentielle. Nous avons intĂ©grĂ© lâoutil et lâanalyse dans une mĂ©thodologie dâanalyse de performance plus globale appelĂ©e Pamda.Nous dĂ©crirons aussi les diffĂ©rents apports Ă lâoutil Decan. Sont particuliĂšrement dĂ©taillĂ©es les techniques de prĂ©servation des structures de contrĂŽle du programme,ainsi que lâajout du support pour les programmes parallĂšles.Finalement, nous effectuons une Ă©tude statistique qui permet de vĂ©rifier la possibilitĂ© dâutiliser des compteurs dâĂ©vĂšnements, autres que le temps dâexĂ©cution, comme mĂ©triques de comparaison entre les variantes Deca
Architecting, programming, and evaluating an on-chip incoherent multi-processor memory hierarchy
New architectures for extreme-scale computing need to be designed for higher energy efficiency than current systems. The DOE-funded Traleika Glacier architecture is a recently-proposed extreme-scale manycore that radically simplifies the architecture, and proposes a cluster-based on-chip memory hierarchy without hardware cache coherence. Programming for such an environment, which can use scratchpads or incoherent caches, is challenging. Hence, this thesis focuses on architecting, programming, and evaluating an on-chip incoherent multiprocessor memory hierarchy.
This thesis starts by examining incoherent multiprocessor caches. It proposes ISA support for data movement in such an environment, and two relatively user-friendly programming approaches that use the ISA. The ISA support is largely based on writeback and self-invalidation instructions, while the programming approaches involve shared-memory programming either inside a cluster only, or across clusters. The thesis also includes compiler transformations for such an incoherent cache hierarchy.
Our simulation results show that, with our approach, the execution of applications on incoherent cache hierarchies can deliver reasonable performance. For execution within a cluster, the average execution time of our applications is only 2% higher than with hardware cache coherence. For execution across multiple clusters, our applications run on average 20% faster than a naive scheme that pushes all the data to the last-level shared cache. Compiler transformations for both regular and irregular applications are shown to deliver substantial performance increases.
This thesis then considers scratchpads. It takes the design in the Traleika Glacier architecture and performs a simulation-based evaluation. It shows how the hardware exploits available concurrency from parallel applications. However, it also shows the limitations of the current software stack, which lacks smart memory management and high-level hints for the scheduler
Combiner approches statique et dynamique pour modéliser la performance de boucles HPC
The complexity of CPUs has increased considerably since their beginnings, introducing mechanisms such as register renaming, out-of-order execution, vectorization,prefetchers and multi-core environments to keep performance rising with each product generation. However, so has the difficulty in making proper use of all these mechanisms, or even evaluating whether oneâs program makes good use of a machine,whether usersâ needs match a CPUâs design, or, for CPU architects, knowing how each feature really affects customers.This thesis focuses on increasing the observability of potential bottlenecks inHPC computational loops and how they relate to each other in modern microarchitectures.We will first introduce a framework combining CQA and DECAN (respectively static and dynamic analysis tools) to get detailed performance metrics on smallcodelets in various execution scenarios.We will then present PAMDA, a performance analysis methodology leveraging elements obtained from codelet analysis to detect potential performance problems in HPC applications and help resolve them. A work extending the Cape linear model to better cover Sandy Bridge and give it more flexibility for HW/SW codesign purposes will also be described. It will bedirectly used in VP3, a tool evaluating the performance gains vectorizing loops could provide.Finally, we will describe UFS, an approach combining static analysis and cycle accurate simulation to very quickly estimate a loopâs execution time while accounting for out-of-order limitations in modern CPUsLa complexitĂ© des CPUs sâest accrue considĂ©rablement depuis leurs dĂ©buts, introduisant des mĂ©canismes comme le renommage de registres, lâexĂ©cution dans le dĂ©sordre, la vectorisation, les prĂ©fetchers et les environnements multi-coeurs pour amĂ©liorer les performances avec chaque nouvelle gĂ©nĂ©ration de processeurs. Cependant, la difficultĂ© a suivi la mĂȘme tendance pour ce qui est a) dâutiliser ces mĂȘmes mĂ©canismes Ă leur plein potentiel, b) dâĂ©valuer si un programme utilise une machine correctement, ou c) de savoir si le design dâun processeur rĂ©pond bien aux besoins des utilisateurs.Cette thĂšse porte sur lâamĂ©lioration de lâobservabilitĂ© des facteurs limitants dans les boucles de calcul intensif, ainsi que leurs interactions au sein de microarchitectures modernes.Nous introduirons dâabord un framework combinant CQA et DECAN (des outils dâanalyse respectivement statique et dynamique) pour obtenir des mĂ©triques dĂ©taillĂ©es de performance sur des petits codelets et dans divers scĂ©narios dâexĂ©cution.Nous prĂ©senterons ensuite PAMDA, une mĂ©thodologie dâanalyse de performance tirant partie de lâanalyse de codelets pour dĂ©tecter dâĂ©ventuels problĂšmes de performance dans des applications de calcul Ă haute performance et en guider la rĂ©solution.Un travail permettant au modĂšle linĂ©aire Cape de couvrir la microarchitecture Sandy Bridge de façon dĂ©taillĂ©e sera dĂ©crit, lui donnant plus de flexibilitĂ© pour effectuer du codesign matĂ©riel / logiciel. Il sera mis en pratique dans VP3, un outil Ă©valuant les gains de performance atteignables en vectorisant des boucles.Nous dĂ©crirons finalement UFS, une approche combinant analyse statique et simulation au cycle prĂšs pour permettre lâestimation rapide du temps dâexĂ©cution dâune boucle en prenant en compte certaines des limites de lâexĂ©cution en dĂ©sordre dans des microarchitectures moderne
Matching non-uniformity for program optimizations on heterogeneous many-core systems
As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
Studies on automatic parallelization for heterogeneous and homogeneous multicore processors
ć¶ćșŠ:æ° ; ć ±ćçȘć·:çČ3537ć· ; ćŠäœăźçšźéĄ:ć棫(ć·„ćŠ) ; æäžćčŽææ„:2012/2/25 ; æ©ć€§ćŠäœèšçȘć·:æ°587
Portability and performance in heterogeneous many core Systems
Dissertação de mestrado em InformåticaCurrent computing systems have a multiplicity of computational resources with different
architectures, such as multi-core CPUs and GPUs. These platforms are known as
heterogeneous many-core systems (HMS) and as computational resources evolve they
are o ering more parallelism, as well as becoming more heterogeneous. Exploring these
devices requires the programmer to be aware of the multiplicity of associated architectures,
computing models and development framework. Portability issues, disjoint
memory address spaces, work distribution and irregular workload patterns are major
examples that need to be tackled in order to e ciently explore the computational
resources of an HMS.
This dissertation goal is to design and evaluate a base architecture that enables
the identi cation and preliminary evaluation of the potential bottlenecks and limitations
of a runtime system that addresses HMS. It proposes a runtime system that
eases the programmer burden of handling all the devices available in a heterogeneous
system. The runtime provides a programming and execution model with a uni ed
address space managed by a data management system. An API is proposed in order
to enable the programmer to express applications and data in an intuitive way. Four
di erent scheduling approaches are evaluated that combine di erent data partitioning
mechanisms with di erent work assignment policies and a performance model is used
to provide some performance insights to the scheduler.
The runtime e ciency was evaluated with three di erent applications - matrix multiplication,
image convolution and n-body Barnes-Hut simulation - running in multicore
CPUs and GPUs.
In terms of productivity the results look promising, however, combining scheduling
and data partitioning revealed some ine ciencies that compromise load balancing and
needs to be revised, as well as the data management system that plays a crucial role in
such systems. Performance model driven decisions were also evaluated which revealed
that the accuracy of a performance model is also a compromising component