4,398 research outputs found

    Breadth First Search Vectorization on the Intel Xeon Phi

    Full text link
    Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software

    Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis

    Full text link
    Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile-time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page

    Graphic Symbol Recognition using Graph Based Signature and Bayesian Network Classifier

    Full text link
    We present a new approach for recognition of complex graphic symbols in technical documents. Graphic symbol recognition is a well known challenge in the field of document image analysis and is at heart of most graphic recognition systems. Our method uses structural approach for symbol representation and statistical classifier for symbol recognition. In our system we represent symbols by their graph based signatures: a graphic symbol is vectorized and is converted to an attributed relational graph, which is used for computing a feature vector for the symbol. This signature corresponds to geometry and topology of the symbol. We learn a Bayesian network to encode joint probability distribution of symbol signatures and use it in a supervised learning scenario for graphic symbol recognition. We have evaluated our method on synthetically deformed and degraded images of pre-segmented 2D architectural and electronic symbols from GREC databases and have obtained encouraging recognition rates.Comment: 5 pages, 8 figures, Tenth International Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society, 2009, volume 10, 1325-132

    Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

    Get PDF
    Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer ReviewedPostprint (published version

    A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

    Full text link
    Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape
    • …
    corecore