11 research outputs found

    High performance computing with FPGAs

    Get PDF
    Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated. The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors

    General Purpose Computation on Graphics Processing Units Using OpenCL

    Get PDF
    Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU

    Studies on the Impact of Cache Configuration on Multicore Processor

    Get PDF
    The demand for a powerful memory subsystem is increasing with increase in the number of cores in a multicore processor. The technology adapted to meet the above demands are: increasing the cache size, increasing the number of levels of caches and bymeans of a powerful interconnection network. Caches feeds the processing element at a faster rate. They also provide high bandwidth local memory to work with. In this research, an attempt has beenmade to analyze the impact of cache size on performance of multicore processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN), also referenced from NIAGRA architecture. As the number of cores increases, traditional on-chip interconnect like bus and crossbar proves to be less efficient as well as suffers from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnects, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INoC) for multicore processors has been proposed. The benchmark results are presented using a full system simulator. Results shows that, using the proposed INoC,execution time can be significantly reduced, compared with MPIN.Cache size and set-associativity are the features on which the cache performance is dependent. If the cache size is doubled, then the cache performance can increase but at the cost of high hardware, larger area and more power consumption. Moreover, considering the small form-factor of themobile processors, increase in cache size affects the device size and battery running time. Re-organization and reanalysis of cache onfiguration ofmobile processors are required for achieving better cache performance, lower power consumption and chip area. With identical cache size, performance gained can be obtained from a novel cache mechanism. For simulation, we used SPLASH2 benchmark suite

    Investigating Concurrency in the Co-Simulation Orchestration Engine for INTO-CPS

    Get PDF
    There is a tendency to expect, that taking advantage of multicore systems by using concurrency improves the performance of an application. To investigate if this is true, a case study was performed where different concurrency principles were applied to an existing application called the Co-Simulation Orchestration Engine (COE), which did not utilize concurrency. This was explored in the context of Co-Simulation using the Functional Mock-up Interface, as applications executing Co-Simulations should be performant to enable the use of increasingly complex models.Co-Simulation can be useful in the development of Cyber-Physical Systems, as it can be used to simulate coupled technical systems or models and thereby examine the behavior of the systems.The investigation was carried out by refactoring the COE to make it suitable for implementing concurrency by limiting the spawning of threads and synchronization between threads, along with maximizing the workload for each thread. Three different concurrency features were used in three different implementations: Parallel collections, futures, and actors, which were evaluated based on selected quality attributes. These implementations were tested against the non-refactored sequential COE and each other by performing different simulations using different models.The case study showed, that concurrency can be used to increase the performance of the COE in some cases. Based on the analysis carried out in this thesis project, a set of guidelines were created to generalize the process of applying concurrency to an existing application

    Parallel implementation of maximum parsimony search algorithm on multicore CPUs

    Get PDF
    Phylogenetics is the study of the evolutionary relationships among species. It is derived from the ancient greek words, phylon meaning race , and genetikos, meaning relative to birth . An important methodology in phylogenetics is a cladistics methodology (parsimony) applied to the study of taxonomic classification. Modern study includes as source data aspects of molecular biology, such as the DNA sequence of homologous (orthologous) genes. The algorithms used attempt to reconstruct evolutionary relationships in the form of phylogenetic trees, based on the available morphological data, behavioral data, and usually DNA sequence data (Fitch W. M., 1971). The topic of this thesis is the parallel implementation of an existing algorithm called Maximum Parsimony, a search for a guaranteed optimal tree(s) based the fewest number of mutations required for tree construction. The algorithm grows linearly with the increase in DNA sequence length and combinatorially with the number of organisms studied (Felsenstein J. , The number of evolutionary trees., 1978). The algorithm may take hours to complete. The limitations of the current implementations such as PAUP are that they are limited to just one core on the CPU, even if 8 are available. This parallel implementation may use as many cores as are available. The method of research is to replicate the accuracy of existing serial software, parallelize the algorithm to many cores without losing accuracy, optimize by various methods, then attempt to port to other hardware architectures. Some time is spent on the implementation of the algorithms onto GPUs and Clusters. The results are that, while this implementation matches the accuracy of the current standard, and speeds up in parallel, it does not presently match the speed of PAUP for reasons yet to be determined

    CPU Scheduling for Power/Energy Management on Heterogeneous Multicore Processors

    Full text link
    Power and energy have become increasingly important concerns in the design and implementation of today\u27s multicore/manycore chips. Many methods have been proposed to reduce a microprocessor\u27s power usage and associated heat dissipation, including scaling a core\u27s operating frequency. However, these techniques do not consider the dynamic performance characteristics of an executing process at runtime, the execution characteristics of the entire task to which this process belongs, the process\u27s priority, the process\u27s cache miss/cache reference ratio, the number of context switches and CPU migrations generated by the process, nor the system load. Also, many of the techniques that employ dynamic frequency scaling can lower a core\u27s frequency during the execution of a non-CPU intensive task, thus lowering performance. In addition, many of these methods require specialized hardware and have not been tested upon real hardware that is widely available, including the recent AMD or Intel multicore chips. One problem dealing with power/energy management for heterogeneous multicore processors is: Given a set of processes, each having identical default priorities, in a given task to be executed by a heterogeneous multicore/manycore processor system, schedule each process in this task to execute upon the CPU(s) in this system such that the global power budget is minimized, yet the performance gain of all processes is maximized, and the performance loss of all processes is minimized. Doing so, in a scenario where each process has a different (not necessarily unique) static or dynamic (but not necessarily the default) priority, without adversely affecting process completion order, as dictated by process priority is yet another problem. Finally, utilizing the cache miss/cache reference ratio and the number of context switches and CPU migrations as scheduling criteria are two other problems. This dissertation will elaborate upon these four problems, and will describe our four approaches to solving these problems

    Implementación eficiente de construcciones de alto nivel para la programación concurrente

    Get PDF
    Tesis (Doctor en Ciencias de la Computación)--Universidad Nacional de Córdoba, Facultad de Matemática, Astronomía y Física, 2011.El trabajo de tesis doctoral presenta métodos automáticos para mejorar implementaciones de regiones críticas condicionales y monitores con señalamiento automático, mediante el uso de probadores de teoremos del tipo SMT solver (CVC) y probadores de lógica de alto orden (Isabelle/Isar), utilizados dentros de técnicas de interpretación abstracta. En el caso de regiones críticas condicionales la propuesta se aplica a implementaciones automáticas producidas por la técnica de Semáforos Binarios Divididos desarrollada por E.W.Dijkstra. En el caso de monitores el método mejora implementaciones con señalamiento explícito. Los resultado brindan construcciones eficientes de alto nivel, que permiten desarrollar programas concurrentes de forma simple y correcta.Damián Barsotti

    Predicting software performance in symmetric multi-core and multiprocessor Environments

    Get PDF
    With today\u27s rise of multi-core processors, concurrency becomes a ubiquitous challenge in software development.Performance prediction methods have to reflect the influence of multiprocessing environments on software performance in order to help software architects to find potential performance problems during early development phases. In this thesis, we address the influence of the operating system scheduler on software performance in symmetric multiprocessing environments
    corecore