75 research outputs found
A quasiācacheāaware model for optimal domain partitioning in parallel geometric multigrid
Stencil computations form the heart of numerical simulations to solve Partial Differential Equations using Finite Difference, Finite Element, and Finite Volume methods. Geometric Multigrid is an optimal O(N), hierarchical tool employing stencil computations in its chief constituents, namely, smoothing, restriction, and interpolation. When Multigrid is parallelized over distributedāshared memory architectures, traditionally, the domain partitioning creates cubic partitions of the mesh to minimize overall communication. Thus, the orthodox approach considers only loadābalancing and communication minimization for completely determining the domain partitioning. In this article, we show that these two factors are not sufficient to obtain optimal partitions for Parallel Geometric Multigrid. To this effect, we develop and validate a high level analytical model to show that āclose to 2āDā partitions for Geometric Multigrid can give higher performance than the partitions returned by the MPI_Dims_create() function which minimizes the communication volume by default. We quantify subādomain level cacheāmisses in Parallel Geometric Multigrid and obtain families of optimal domain partitions. We conclude that the subādomain level cacheāmisses for the applicationāspecific stencil computational kernel and communicated planes should be taken into account in addition to communication minimization/loadābalance to obtain optimal partitions for Parallel Geometric Multigrid
High-performance and hardware-aware computing: proceedings of the first International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2708)
The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
āno worseā than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
Hardware counter based performance analysis, modelling, and improvement through thread migration in numa systems
[EN]These last years have seen an important evolution in the computational resources available in science and engineering. Currently, most high performance systems include several multicore processors and use a NUMA (Non Uniform Memory Access) memory architecture. In this context, data locality becomes a highly important issue for parallel codes performance. It is foreseeable that the complexity as SMP (Symmetric Multiprocessing) NUMA systems increases during the next years. These will increase both the number of cores and the memory complexity, including the various cache levels, which implies memory access latency will depend, increasingly, of the proximity or affinity of the different threads to the memory modules where their data reside. Improving the performance and scalability of parallel codes on multicore architectures may be quite complex. This way, memory management on parallel codes will become more complicated, especially from the point of view of a programmer who wishes to obtain the best performance. Not only this, but the problem worsens in the usual case with different processes in execution simultaneously. Automatically migrating executing threads among the cores and processors, depending on their behaviour, may improve performance of parallel programs. Furthermore, it may allow to simplify their development, since the programmer avoids to explicitly manage locality. Modern microprocessors include registers that give useful information at a low cost, usually known as hardware counters (HCs). HCs are not commonly used due to a lack of tools to easily obtain their data. These HCs, in modern processors, allow to obtain the memory access latency during cache miss resolutions, and even the memory address that leads to the event. This opens the door to the development of new techniques for performance improvement based on this information. A procedure to easily and automatically obtain data about a shared memory parallel code execution on SMP multicore and NUMA systems, to model it using the hardware counters of modern processors, alongside additional information, as the memory access latencies from different threads. This procedure will be used during a parallel program execution, at runtime, to model its performance. This information will be used to improve the efficiency of the execution of said parallel codes automatically and transparently to the user.[GL]Hoxe en dĆa, a maiorĆa dos sistemas de computaciĆ³n son multicore e mesmo multiprocessador.
Nestes sistemas, o comportamento dos accesos Ć” memoria de cada fĆo para os distintos nodos de memoria Ć© un dos aspectos que mĆ”is significativamente afectan o rendemento de
calquera cĆ³digo. Este feito Ć© cada vez mĆ”is relevante a medida que aumenta o chamado "memory wall".
Neste traballo, esta cuestiĆ³n foi abordada baixo dous puntos de vista. Desde o punto de vista dun programador de aplicaciĆ³ns paralelas, desenvolvĆ©ronse ferramentas e modelos para
caracterizar o comportamento de cĆ³digos e axudao para a sĆŗa aplicaciĆ³n. Desde o punto de vista dun usuario de aplicaciĆ³ns paralelas, desenvolveuse unha ferramenta de migraciĆ³n para seleccionar e adaptar, automaticamente durante a execuciĆ³n, a colocaciĆ³n de fĆos no sistema para mellorar o seu funcionamento. Todas estas ferramentas fan uso de datos de rendemento en tempo de execuciĆ³n obtidos a partir de Contadores Hardware (HC) presentes nos procesadores Intel.
En comparaciĆ³n cos "software profilers", os HC proporcionan, cunha baixa sobrecarga, unha informaciĆ³n de rendemento detallada e rica referente Ć”s unidades funcionais, caches, acceso Ć” memoria principal por parte da CPU, etc. Outra vantaxe de usalos Ć© que non precisa ningunha modificaciĆ³n do cĆ³digo fonte. Con todo, os tipos e os significados dos contadores hardware varĆan dunha arquitectura a outra debido Ć” variaciĆ³n nas organizaciĆ³ns do hardware.
Ademais, pode haber dificultades para correlacionar as mĆ©tricas de rendemento de baixo nivel co cĆ³digo fonte orixinal. O nĆŗmero limitado de rexistros para almacenar os contadores moitas veces pode forzar aos usuarios a realizar mĆŗltiples mediciĆ³ns para recoller todas as mĆ©tricas de rendemento desexadas.
En concreto, neste traballo, utilizƔronse os Precise Event Based Sampling (PEBS, MOSTRAXE
BASEADO EN EVENTOS PRECISOS) nos procesadores Intel modernos e os Event Address
Register (EARs, REXISTROS DE ENDEREZO DE EVENTO) nos procesadores Itanium 2.
O procesador Itanium 2 ofrece un conxunto de rexistros, os EARs que rexistran os enderezos
de instruciĆ³n e datos dos fallos cachĆ©, e os enderezos de instruciĆ³n e datos de fallos de TLB [25]. Cando se usan para capturar fallos cachĆ©, os EARs permiten a detecciĆ³n das latencias maiores de 4 ciclos. Xa que os accesos de punto flotante sempre provocan un fallo (os datos de punto flotante son sempre almacenados na L2D), calquer acceso pode ser potencialmente detectado. Os EARs permiten a mostraxe estatĆstica, configurando un contador de rendemento para contar as apariciĆ³ns dun determinado evento.
O PEBS usa un mecanismo de interrupciĆ³n cos HC para almacenar un conxunto de informaciĆ³n sobre o estado da arquitectura para o procesador. A informaciĆ³n ofrece o estado arquitectĆ³nico da instruciĆ³n executada despois da instruciĆ³n que causou o evento. Xunto con esta informaciĆ³n, que inclĆŗe o estado de todos os rexistros, os procesadores Sandy Bridge posĆŗen un sistema de mediciĆ³n da latencia a memoria. Ista Ć© un medio para caracterizar a latencia de carga media para os diferentes niveis da xerarquĆa de memoria. A latencia Ć© medida dende a expediciĆ³n da instrucciĆ³n ata cando os datos son globalmente observables, e dicir, cando chegan ao procesador. AdemĆ”is da latencia, o PEBS permite coƱecer a orixe dos datos e o nivel de memoria de onde se leron. A diferenza dos EARs, o PEBS permite tamĆ©n medir a latencia de operaciĆ³ns enteiras ou de almacenamento de dato
Efficient Domain Partitioning for Stencil-based Parallel Operators
Partial Differential Equations (PDEs) are used ubiquitously in modelling natural phenomena. It is generally not possible to obtain an analytical solution and hence they are commonly discretized using schemes such as the Finite Difference Method (FDM) and the Finite Element Method (FEM), converting the continuous PDE to a discrete system of sparse algebraic equations. The solution of this system can be approximated using iterative methods, which are better suited to many sparse systems than direct methods.
In this thesis we use the FDM to discretize linear, second order, Elliptic PDEs and consider parallel implementations of standard iterative solvers. The dominant paradigm in this field is distributed memory parallelism which requires the FDM grid to be partitioned across the available computational cores. The orthodox approach to domain partitioning aims to minimize only the communication volume and achieve perfect load-balance on each core. In this work, we re-examine and challenge this traditional method of domain partitioning and show that for well load-balanced problems, minimizing only the communication volume is insufficient for obtaining optimal domain partitions. To this effect we create a high-level, quasi-cache-aware mathematical model that quantifies cache-misses at the sub-domain level and minimizes them to obtain families of high performing domain decompositions. To our knowledge this is the first work that optimizes domain partitioning by analyzing cache misses, establishing a relationship between cache-misses and domain partitioning.
To place our model in its true context, we identify and qualitatively examine multiple other factors such as the Least Recently Used policy, Cache Line Utilization and Vectorization, that influence the choice of optimal sub-domain dimensions. Since the convergence rate of point iterative methods, such as Jacobi, for uniform meshes is not acceptable at a high mesh resolution, we extend the model to Parallel Geometric Multigrid (GMG). GMG is a multilevel, iterative, optimal algorithm for numerically solving Elliptic PDEs. Adaptive Mesh Refinement (AMR) is another multilevel technique that allows local refinement of a global mesh based on parameters such as error estimates or geometric importance. We study a massively parallel, multiphysics, multi-resolution AMR framework called BoxLib, and implement and discuss our model on single level and adaptively refined meshes, respectively.
We conclude that āclose to 2-Dā partitions are optimal for stencil-based codes on structured 3-D domains and that it is necessary to optimize for both minimizing cache-misses and communication. We advise that in light of the evolving hardware-software ecosystem, there is an imperative need to re-examine conventional domain partitioning strategies
Efficient reconfigurable architectures for 3D medical image compression
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities,
such as magnetic resonance imaging (MRI), computed tomography (CT), positron
emission tomography (PET), and ultrasound (US) have generated a massive amount
of volumetric data. These have provided an impetus to the development of other
applications, in particular telemedicine and teleradiology. In these fields, medical
image compression is important since both efficient storage and transmission of data
through high-bandwidth digital communication lines are of crucial importance.
Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow
for quick upgradeability with real-time applications. Moreover, in order to obtain
efficient solutions for large medical volumes data, an efficient implementation of
these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system
building block in the construction of high-performance systems at an economical price.
Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent
advantages such as massive parallelism capabilities, multimillion gate counts, and
special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are
optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits
promising results in reducing Gaussian white noise in medical images. In terms of
hardware implementation, promising trade-offs on maximum frequency, throughput
and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC)
has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete
wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that
3-D IT demonstrates better computational complexity than the 3-D DWT, whilst
the 3-D DWT with LS exhibits a lossless compression that is significantly useful for
medical image compression. Additionally, an architecture of CAVLC that is capable
of compressing high-definition (HD) images in real-time without any buffer between
the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the
slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE),
Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci
- ā¦