Search CORE

561 research outputs found

Multi-Softcore Architecture on FPGA

Author: Mohamed Abid
Mouna Baklouti
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication)

Crossref

Directory of Open Access Journals

An autotuning framework for Intel Xeon Phi platforms

Author: Christoforidis Eleftherios - Iordanis
Χριστοφορίδης Ελευθέριος - Ιορδάνης
Publication venue
Publication date: 15/09/2016
Field of study

DSpace at NTUA

Scalable Breadth-First Search on a GPU Cluster

Author: Owens John D.
Pan Yuechao
Pearce Roger
Publication venue
Publication date: 13/03/2018
Field of study

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Comment: 12 pages, 13 figures. To appear at IPDPS 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Performance Comparison Of Two Data Mining Algorithms On Big Data Platforms

Author: Raju Md Rajiur Rahman
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2015
Field of study

In this Big data era, the need for performing large-scale computations is evident. A better understanding of the most suitable platforms which can efficiently run these computations is needed. In this thesis, we attempt to compare four such big data platforms, namely Hadoop, Spark, GPU, and Multicore CPU. We compare these platforms using two prominent data mining algorithms, namely, K-means clustering and K-nearest neighbour classification and discuss specific implementation-level details. We provide several insights into the best possible implementations of these algorithms and systematically compare the benefits and drawbacks of each of these platforms. We conduct experiments by varying data size and parameters to obtain runtime and scalability performances of these platforms. Our experiments show that GPU and Multicore CPU are faster but have certain limitations. On the other hand, Hadoop and Spark are able to handle large scale datasets. We also observe that Spark performs better than Hadoop for both iterative and non-iterative jobs. In summary, we have examined different characteristics of four big data platforms and provided comparative analysis for the cases of two algorithms. Since many other data mining algorithms either use these two methods during pre-processing or as an integral component, we hope that our analysis will have impact in many other applications and algorithms beyond the ones that are being reported in this thesis

Digital Commons@Wayne State University

A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials

Author: Fehske Holger
Hager Georg
Pieper Andreas
Publication venue: 'SAGE Publications'
Publication date: 29/09/2020
Field of study

We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and Topological Materials), a library and code generator based on a domain-specific language tailored to implement the specific stencil-like algorithms that can describe Dirac and topological materials such as graphene and topological insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP) code is fully vectorized using Single Instruction Multiple Data (SIMD) extensions. It is significantly faster than matrix-based approaches on the node level and performs in accordance with the roofline model. We demonstrate the chip-level performance and distributed-memory scalability of basic building blocks such as sparse matrix-(multiple-) vector multiplication on modern multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i) explore the scattering of a Dirac wave on an array of gate-defined quantum dots, to (ii) calculate a bunch of interior eigenvalues for strong topological insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl semimetal.Comment: 16 pages, 2 tables, 11 figure

arXiv.org e-Print Archive

Crossref

Publication Server of Greifswald University

Multi-Softcore Architecture on FPGA

Author: Mohamed Abid
Mouna Baklouti
Publication venue
Publication date: 24/04/2020
Field of study

CiteSeerX

Auto-tuning similarity search algorithms on multi-core architectures

Author: Gedik B.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Cataloged from PDF version of article.In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k-NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k-NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or “tuning knobs”, are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. © Springer Science+Business Media New York 201

Bilkent University Institutional Repository

An Interactive Learning System for Large-Scale Multimedia Analytics

Author: Choi Jaeyoung
Datar Mayur
Gornishka Iva
Larson Martha
Lv Qin
Radim
Szegedy C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2020
Field of study

Crossref

The IT University of Copenhagen's Repository

Recommended from our members

A High-Performance Domain-Specific Language and Code Generator for General N-body Problems

Author: Aghababaie Beni Laleh
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

General N-body problems are a set of problems in which an update to a single element in the system depends on every other element. N-body problems are ubiquitous, with applications in various domains ranging from scientific computing simulations in molecular dynamics, astrophysics, acoustics, and fluid dynamics all the way to computer vision, data mining and machine learning problems. Different N-body algorithms have been designed and implemented in these various fields. However, there is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. It is time-consuming to write fast, parallel, and scalable code for these problems. On the other hand, the sheer scale and growth of modern scientific datasets necessitate exploiting the power of both parallel and approximation algorithms where there is a potential to trade-off accuracy for performance. The main problem that we are tackling in this thesis is how to automatically generate asymptotically optimal N-body algorithms from the high-level specification of the problem. We combine the body of work in performance optimizations, compilers and the domain of N-body problems to build a unified system where domain scientists can write programs at the high level while attaining performance of code written by an expert at the low level.In order to generate a high-performance, scalable code for this group of problems, we take the following steps in this thesis; first, we propose a unified algorithmic framework named PASCAL in order to address the challenge of designing a general algorithmic template to represent the class of N-body problems. PASCAL utilizes space-partitioning trees and user-controlled pruning/approximations to reduce the asymptotic runtime complexity from linear to logarithmic in the number of data points. In PASCAL, we design an algorithm that automatically generates conditions for pruning or approximation of an N-body problem considering the problem's definition. In order to evaluate PASCAL, we developed tree-based algorithms for six well-known problems: k-nearest neighbors, range search, minimum spanning tree, kernel density estimation, expectation maximization, and Hausdorff distance. We show that applying domain-specific optimizations and parallelization to the algorithms written in PASCAL achieves 10x to 230x speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real-world datasets. Second, we extend the PASCAL framework to build PASCAL-X that adds support for NUMA-aware parallelization. PASCAL-X also presents insights on the influence of tuning parameters. Tuning parameters such as leaf size (influences the shape of the tree) and cut-off level (controls the granularity of tasks) of the space-partitioning trees result in performance improvement of up to 4.6x. A key goal is to generate scalable and high-performance code automatically without sacrificing productivity. That implies minimizing the effort the users have to put in to generate the desired high-performance code. Another critical factor is the adaptivity, which indicates the amount of effort that is required to extend the high-performance code generation to new N-body problems. Finally, we consider these factors and develop a domain-specific language and code generator named Portal, which is built on top of PASCAL-X. Portal's language design is inspired by the mathematical representation of N-body problems, resulting in an intuitive language for rapid implementation of a variety of problems. Portal's back-end is designed and implemented to generate optimized, parallel, and scalable implementations for multi-core systems. We demonstrate that the performance achieved by using Portal is comparable to that of expert hand-optimized code while providing productivity for domain scientists. For instance, using Portal for the k-nearest neighbors problem gains performance that is similar to the hand-optimized code, while reducing the lines of code by 68x. To the best of our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of general N-body problems and this thesis primarily aims to fill this gap. Finally, we present a case study of Portal for the real-world problem of face clustering. In this case study, we show that Portal not only provides a fast solution for the face clustering problem with similar accuracy as the state-of-the-art algorithm, but also it provides productivity by implementing the face clustering algorithm in only 14 lines of Portal code

eScholarship - University of California