75 research outputs found

    Adaptive Sampling in Particle Image Velocimetry

    Get PDF

    A quasiā€cacheā€aware model for optimal domain partitioning in parallel geometric multigrid

    Get PDF
    Stencil computations form the heart of numerical simulations to solve Partial Differential Equations using Finite Difference, Finite Element, and Finite Volume methods. Geometric Multigrid is an optimal O(N), hierarchical tool employing stencil computations in its chief constituents, namely, smoothing, restriction, and interpolation. When Multigrid is parallelized over distributedā€shared memory architectures, traditionally, the domain partitioning creates cubic partitions of the mesh to minimize overall communication. Thus, the orthodox approach considers only loadā€balancing and communication minimization for completely determining the domain partitioning. In this article, we show that these two factors are not sufficient to obtain optimal partitions for Parallel Geometric Multigrid. To this effect, we develop and validate a high level analytical model to show that ā€œclose to 2ā€Dā€ partitions for Geometric Multigrid can give higher performance than the partitions returned by the MPI_Dims_create() function which minimizes the communication volume by default. We quantify subā€domain level cacheā€misses in Parallel Geometric Multigrid and obtain families of optimal domain partitions. We conclude that the subā€domain level cacheā€misses for the applicationā€specific stencil computational kernel and communicated planes should be taken into account in addition to communication minimization/loadā€balance to obtain optimal partitions for Parallel Geometric Multigrid

    High-performance and hardware-aware computing: proceedings of the first International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2708)

    Get PDF
    The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ā€˜no worseā€™ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Hardware counter based performance analysis, modelling, and improvement through thread migration in numa systems

    Get PDF
    [EN]These last years have seen an important evolution in the computational resources available in science and engineering. Currently, most high performance systems include several multicore processors and use a NUMA (Non Uniform Memory Access) memory architecture. In this context, data locality becomes a highly important issue for parallel codes performance. It is foreseeable that the complexity as SMP (Symmetric Multiprocessing) NUMA systems increases during the next years. These will increase both the number of cores and the memory complexity, including the various cache levels, which implies memory access latency will depend, increasingly, of the proximity or affinity of the different threads to the memory modules where their data reside. Improving the performance and scalability of parallel codes on multicore architectures may be quite complex. This way, memory management on parallel codes will become more complicated, especially from the point of view of a programmer who wishes to obtain the best performance. Not only this, but the problem worsens in the usual case with different processes in execution simultaneously. Automatically migrating executing threads among the cores and processors, depending on their behaviour, may improve performance of parallel programs. Furthermore, it may allow to simplify their development, since the programmer avoids to explicitly manage locality. Modern microprocessors include registers that give useful information at a low cost, usually known as hardware counters (HCs). HCs are not commonly used due to a lack of tools to easily obtain their data. These HCs, in modern processors, allow to obtain the memory access latency during cache miss resolutions, and even the memory address that leads to the event. This opens the door to the development of new techniques for performance improvement based on this information. A procedure to easily and automatically obtain data about a shared memory parallel code execution on SMP multicore and NUMA systems, to model it using the hardware counters of modern processors, alongside additional information, as the memory access latencies from different threads. This procedure will be used during a parallel program execution, at runtime, to model its performance. This information will be used to improve the efficiency of the execution of said parallel codes automatically and transparently to the user.[GL]Hoxe en dĆ­a, a maiorĆ­a dos sistemas de computaciĆ³n son multicore e mesmo multiprocessador. Nestes sistemas, o comportamento dos accesos Ć” memoria de cada fĆ­o para os distintos nodos de memoria Ć© un dos aspectos que mĆ”is significativamente afectan o rendemento de calquera cĆ³digo. Este feito Ć© cada vez mĆ”is relevante a medida que aumenta o chamado "memory wall". Neste traballo, esta cuestiĆ³n foi abordada baixo dous puntos de vista. Desde o punto de vista dun programador de aplicaciĆ³ns paralelas, desenvolvĆ©ronse ferramentas e modelos para caracterizar o comportamento de cĆ³digos e axudao para a sĆŗa aplicaciĆ³n. Desde o punto de vista dun usuario de aplicaciĆ³ns paralelas, desenvolveuse unha ferramenta de migraciĆ³n para seleccionar e adaptar, automaticamente durante a execuciĆ³n, a colocaciĆ³n de fĆ­os no sistema para mellorar o seu funcionamento. Todas estas ferramentas fan uso de datos de rendemento en tempo de execuciĆ³n obtidos a partir de Contadores Hardware (HC) presentes nos procesadores Intel. En comparaciĆ³n cos "software profilers", os HC proporcionan, cunha baixa sobrecarga, unha informaciĆ³n de rendemento detallada e rica referente Ć”s unidades funcionais, caches, acceso Ć” memoria principal por parte da CPU, etc. Outra vantaxe de usalos Ć© que non precisa ningunha modificaciĆ³n do cĆ³digo fonte. Con todo, os tipos e os significados dos contadores hardware varĆ­an dunha arquitectura a outra debido Ć” variaciĆ³n nas organizaciĆ³ns do hardware. Ademais, pode haber dificultades para correlacionar as mĆ©tricas de rendemento de baixo nivel co cĆ³digo fonte orixinal. O nĆŗmero limitado de rexistros para almacenar os contadores moitas veces pode forzar aos usuarios a realizar mĆŗltiples mediciĆ³ns para recoller todas as mĆ©tricas de rendemento desexadas. En concreto, neste traballo, utilizĆ”ronse os Precise Event Based Sampling (PEBS, MOSTRAXE BASEADO EN EVENTOS PRECISOS) nos procesadores Intel modernos e os Event Address Register (EARs, REXISTROS DE ENDEREZO DE EVENTO) nos procesadores Itanium 2. O procesador Itanium 2 ofrece un conxunto de rexistros, os EARs que rexistran os enderezos de instruciĆ³n e datos dos fallos cachĆ©, e os enderezos de instruciĆ³n e datos de fallos de TLB [25]. Cando se usan para capturar fallos cachĆ©, os EARs permiten a detecciĆ³n das latencias maiores de 4 ciclos. Xa que os accesos de punto flotante sempre provocan un fallo (os datos de punto flotante son sempre almacenados na L2D), calquer acceso pode ser potencialmente detectado. Os EARs permiten a mostraxe estatĆ­stica, configurando un contador de rendemento para contar as apariciĆ³ns dun determinado evento. O PEBS usa un mecanismo de interrupciĆ³n cos HC para almacenar un conxunto de informaciĆ³n sobre o estado da arquitectura para o procesador. A informaciĆ³n ofrece o estado arquitectĆ³nico da instruciĆ³n executada despois da instruciĆ³n que causou o evento. Xunto con esta informaciĆ³n, que inclĆŗe o estado de todos os rexistros, os procesadores Sandy Bridge posĆŗen un sistema de mediciĆ³n da latencia a memoria. Ista Ć© un medio para caracterizar a latencia de carga media para os diferentes niveis da xerarquĆ­a de memoria. A latencia Ć© medida dende a expediciĆ³n da instrucciĆ³n ata cando os datos son globalmente observables, e dicir, cando chegan ao procesador. AdemĆ”is da latencia, o PEBS permite coƱecer a orixe dos datos e o nivel de memoria de onde se leron. A diferenza dos EARs, o PEBS permite tamĆ©n medir a latencia de operaciĆ³ns enteiras ou de almacenamento de dato

    Efficient Domain Partitioning for Stencil-based Parallel Operators

    Get PDF
    Partial Differential Equations (PDEs) are used ubiquitously in modelling natural phenomena. It is generally not possible to obtain an analytical solution and hence they are commonly discretized using schemes such as the Finite Difference Method (FDM) and the Finite Element Method (FEM), converting the continuous PDE to a discrete system of sparse algebraic equations. The solution of this system can be approximated using iterative methods, which are better suited to many sparse systems than direct methods. In this thesis we use the FDM to discretize linear, second order, Elliptic PDEs and consider parallel implementations of standard iterative solvers. The dominant paradigm in this field is distributed memory parallelism which requires the FDM grid to be partitioned across the available computational cores. The orthodox approach to domain partitioning aims to minimize only the communication volume and achieve perfect load-balance on each core. In this work, we re-examine and challenge this traditional method of domain partitioning and show that for well load-balanced problems, minimizing only the communication volume is insufficient for obtaining optimal domain partitions. To this effect we create a high-level, quasi-cache-aware mathematical model that quantifies cache-misses at the sub-domain level and minimizes them to obtain families of high performing domain decompositions. To our knowledge this is the first work that optimizes domain partitioning by analyzing cache misses, establishing a relationship between cache-misses and domain partitioning. To place our model in its true context, we identify and qualitatively examine multiple other factors such as the Least Recently Used policy, Cache Line Utilization and Vectorization, that influence the choice of optimal sub-domain dimensions. Since the convergence rate of point iterative methods, such as Jacobi, for uniform meshes is not acceptable at a high mesh resolution, we extend the model to Parallel Geometric Multigrid (GMG). GMG is a multilevel, iterative, optimal algorithm for numerically solving Elliptic PDEs. Adaptive Mesh Refinement (AMR) is another multilevel technique that allows local refinement of a global mesh based on parameters such as error estimates or geometric importance. We study a massively parallel, multiphysics, multi-resolution AMR framework called BoxLib, and implement and discuss our model on single level and adaptively refined meshes, respectively. We conclude that ā€œclose to 2-Dā€ partitions are optimal for stencil-based codes on structured 3-D domains and that it is necessary to optimize for both minimizing cache-misses and communication. We advise that in light of the evolving hardware-software ecosystem, there is an imperative need to re-examine conventional domain partitioning strategies

    Energy Efficiency Models for Scientific Applications on Supercomputers

    Get PDF

    Efficient reconfigurable architectures for 3D medical image compression

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities, such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US) have generated a massive amount of volumetric data. These have provided an impetus to the development of other applications, in particular telemedicine and teleradiology. In these fields, medical image compression is important since both efficient storage and transmission of data through high-bandwidth digital communication lines are of crucial importance. Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow for quick upgradeability with real-time applications. Moreover, in order to obtain efficient solutions for large medical volumes data, an efficient implementation of these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system building block in the construction of high-performance systems at an economical price. Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent advantages such as massive parallelism capabilities, multimillion gate counts, and special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits promising results in reducing Gaussian white noise in medical images. In terms of hardware implementation, promising trade-offs on maximum frequency, throughput and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC) has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that 3-D IT demonstrates better computational complexity than the 3-D DWT, whilst the 3-D DWT with LS exhibits a lossless compression that is significantly useful for medical image compression. Additionally, an architecture of CAVLC that is capable of compressing high-definition (HD) images in real-time without any buffer between the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE), Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci
    • ā€¦
    corecore