91 research outputs found

    Breadth First Search Vectorization on the Intel Xeon Phi

    Full text link
    Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software

    Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

    Get PDF
    Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer ReviewedPostprint (published version

    Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture

    Full text link
    Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL architectural enhancements on the performance of a state-of-the art graph framework.We have four key observations: 1 Different graph applications require distinctive numbers of threads to reach the peak performance. For the same application, various datasets need even different numbers of threads to achieve the best performance. 2 Only a few graph applications benefit from the high bandwidth MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units executing AVX512 SIMD instructions on KNLs are underutilized when running the state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness. At last, We suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture," 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, United Kingdom, 2018, pp. 199-20

    Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

    Full text link
    Intel Xeon Phi is a recently released high-performance coprocessor which features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cutting-edge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge general purpose processors and GPUs

    One machine, one minute, three billion tetrahedra

    Full text link
    This paper presents a new scalable parallelization scheme to generate the 3D Delaunay triangulation of a given set of points. Our first contribution is an efficient serial implementation of the incremental Delaunay insertion algorithm. A simple dedicated data structure, an efficient sorting of the points and the optimization of the insertion algorithm have permitted to accelerate reference implementations by a factor three. Our second contribution is a multi-threaded version of the Delaunay kernel that is able to concurrently insert vertices. Moore curve coordinates are used to partition the point set, avoiding heavy synchronization overheads. Conflicts are managed by modifying the partitions with a simple rescaling of the space-filling curve. The performances of our implementation have been measured on three different processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds to a generation rate of over 55 million tetrahedra per second. We finally show how this very efficient parallel Delaunay triangulation can be integrated in a Delaunay refinement mesh generator which takes as input the triangulated surface boundary of the volume to mesh

    Accelerating Pattern Matching in Neuromorphic Text Recognition System Using Intel Xeon Phi Coprocessor

    Get PDF
    Neuromorphic computing systems refer to the computing architecture inspired by the working mechanism of human brains. The rapidly reducing cost and increasing performance of state-of-the-art computing hardware allows large-scale implementation of machine intelligence models with neuromorphic architectures and opens the opportunity for new applications. One such computing hardware is Intel Xeon Phi coprocessor, which delivers over a TeraFLOP of computing power with 61 integrated processing cores. How to efficiently harness such computing power to achieve real time decision and cognition is one of the key design considerations. This work presents an optimized implementation of Brain-State-in-a-Box (BSB) neural network model on the Xeon Phi coprocessor for pattern matching in the context of intelligent text recognition of noisy document images. From a scalability standpoint on a High Performance Computing (HPC) platform we show that efficient workload partitioning and resource management can double the performance of this many-core architecture for neuromorphic applications

    Tools for improving performance portability in heterogeneous environments

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract] Parallel computing is currently partially dominated by the availability of heterogeneous devices. These devices differ from each other in aspects such as the instruction set they execute, the number and the type of computing devices that they offer or the structure of their memory systems. In the last years, langnages, libraries and extensions have appeared to allow to write a parallel code once aud run it in a wide variety of devices, OpenCL being the most widespread solution of this kind. However, functional portability does not imply performance portability. This way, one of the probletns that is still open in this field is to achieve automatic performance portability. That is, the ability to automatically tune a given code for any device where it will be execnted so that it ill obtain a good performance. This thesis develops three different solutions to tackle this problem. The three of them are based on typical source-to-sonrce optimizations for heterogeneous devices. Both the set of optimizations to apply and the way they are applied depend on different optimization parameters, whose values have to be tuned for each specific device. The first solution is OCLoptimizer, a source-to-source optimizer that can optimize annotated OpenCL kemels with the help of configuration files that guide the optimization process. The tool optimizes kernels for a specific device, and it is also able to automate the generation of functional host codes when only a single kernel is optimized. The two remaining solutions are built on top of the Heterogeneous Programming Library (HPL), a C++ framework that provides an easy and portable way to exploit heterogeneous computing systexns. The first of these solutions uses the run-time code generation capabilities of HPL to generate a self-optimizing version of a matrix multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions use search processes to find the best values for the optimization parameters, this Iast alternative relies on heuristics bMed on general optirnization strategies.[Resumen] Actualmente la computación paralela se encuentra dominada parcialmente por los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre sí en características tales como el conjunto de instrucciones que ejecutan, el número y tipo de unidades de computación que incluyen o la estructura de sus sistemas de memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones que permiten escribir una única vez la versión paralela de un código y ejecutarla en un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo es la automatización de la portabilidad de rendimiento, es decir, la capacidad de adaptar automáticamente un código dado para su ejecución en cualquier dispositivo y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo concreto. La primera solución planteada es OCLoptirnizer, un optimizador de código a código que a partir de kernels OpenCL anotados y ficheros de configuración como apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto. Además, cuando el kernel a optimizar es único, automatiza la generación de un código de host funcional para ese kernel. Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las capacidades de generación de código en tiempo de ejecución de HPL para generar versiones de un producto de matrices que se adaptan automáticamente en tiempo de ejecución a las características de un dispositivo concreto. La última solución consiste en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna que se puedan obtener en tiempo de ejecución versiones optimizadas de un código HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos de búsqueda para encontrar los mejores valores para los parámetros de optimización, esta última altemativa se basa para ello en heurísticas definidas a partir de recomendaciones generales de optimización.[Resumo] Actualmente a computación paralela atópase dominada parcialmente polos múltiples dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características tales como o conxunto de instruccións que executan, o número e tipo de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~ ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten escribir unha soa vez a versión paralela dun código e executala nun amplio abano de dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns dos grandes problemas que segue aberto neste campo é a automatización da portabilidade de rendemento, isto é, a capacidade de adaptar automaticamente un código dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta tese aborda este problema propondo tres solucións diferentes. As tres están baseadas na aplicación de optimizacións de código a código usadas habitualmente en disp~ sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de aplicalas dependen de varios parámetros de optimización para os que é preciso fixar determinados valores en función do dispositivo concreto. A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio, obtén versións optimizadas dos devanditos kernels para un dispositivo concreto. Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código de host funcional para ese kernel. As outras dúas solucións foron implementadas utilizando Heterogeneous Programming Library (HPL), unha biblioteca C++ que permite programar sistemas heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións dun produto de matrices que se adaptan automaticamente ás características dun dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<; dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan procesos de procura para atopar os mellares valores para os parámetros de optimización, esta última alternativa baséase para iso en heurísticas definidas a partir de recomendacións xerais de optimización

    Research And Application Of Parallel Computing Algorithms For Statistical Phylogenetic Inference

    Get PDF
    Estimating the evolutionary history of organisms, phylogenetic inference, is a critical step in many analyses involving biological sequence data such as DNA. The likelihood calculations at the heart of the most effective methods for statistical phylogenetic analyses are extremely computationally intensive, and hence these analyses become a bottleneck in many studies. Recent progress in computer hardware, specifically the increase in pervasiveness of highly parallel, many-core processors has created opportunities for new approaches to computationally intensive methods, such as those in phylogenetic inference. We have developed an open source library, BEAGLE, which uses parallel computing methods to greatly accelerate statistical phylogenetic inference, for both maximum likelihood and Bayesian approaches. BEAGLE defines a uniform application programming interface and includes a collection of efficient implementations that use NVIDIA CUDA, OpenCL, and C++ threading frameworks for evaluating likelihoods under a wide variety of evolutionary models, on GPUs as well as on multi-core CPUs. BEAGLE employs a number of different parallelization techniques for phylogenetic inference, at different granularity levels and for distinct processor architectures. On CUDA and OpenCL devices, the library enables concurrent computation of site likelihoods, data subsets, and independent subtrees. The general design features of the library also provide a model for software development using parallel computing frameworks that is applicable to other domains. BEAGLE has been integrated with some of the leading programs in the field, such as MrBayes and BEAST, and is used in a diverse range of evolutionary studies, including those of disease causing viruses. The library can provide significant performance gains, with the exact increase in performance depending on the specific properties of the data set, evolutionary model, and hardware. In general, nucleotide analyses are accelerated on the order of 10-fold and codon analyses on the order of 100-fold

    Algorithmische und Code-Optimierungen Molekulardynamiksimulationen für Verfahrenstechnik

    Get PDF
    The focus of this work lies on implementational improvements and, in particular, node-level performance optimization of the simulation software ls1-mardyn. Through data structure improvements, SIMD vectorization and, especially, OpenMP parallelization, the world’s first simulation of 2*1013 molecules at over 1 PFLOP/sec was enabled. To allow for long-range interactions, the Fast Multipole Method was introduced to ls1-mardyn. The algorithm was optimized for sequential, shared-memory, and distributed-memory execution on up to 32,768 MPI processes.Der Fokus dieser Arbeit liegt auf Code-Optimierungen und insbesondere Leistungsoptimierung auf Knoten-Ebene für die Simulationssoftware ls1-mardyn. Durch verbesserte Datenstrukturen, SIMD-Vektorisierung und vor allem OpenMP-Parallelisierung wurde die weltweit erste Petaflop-Simulation von 2*1013 Molekülen ermöglicht. Zur Simulation von langreichweitigen Wechselwirkungen wurde die Fast-Multipole-Methode in ls1-mardyn eingeführt. Sequenzielle, Shared- und Distributed-Memory-Optimierungen wurden angewandt und erlaubten eine Ausführung auf bis zu 32768 MPI-Prozessen
    corecore