Search CORE

67,075 research outputs found

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Author: Cristal Kestelman Adrián
Duric Milovan
Palomar Pérez Óscar
Ratkovic Ivan
Stanic Milan
Unsal Osman Sabri
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.The research leading to these results has received funding from the European Research Council under the European Unions 7th FP (FP/2007- 2013) / ERC GA n. 321253. It has been partially funded by the Spanish Government (TIN2012-34557)Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Importance of Explicit Vectorization for CPU and GPU Software Performance

Author: Allen
Anderson
Berg
Eichenberger
Firas Hamze
Hamze
Kamran Karimi
Karimi
Karimi
Kirk
Knuth
Marsaglia
Matsumoto
Metropolis
Neil G. Dickson
Owens
Preis
Samant
Scott
Suzuki
Tomov
Publication venue: 'Elsevier BV'
Publication date: 31/03/2010
Field of study

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.Comment: 17 pages, 17 figure

arXiv.org e-Print Archive

Crossref

A binary self-organizing map and its FPGA implementation

Author: Appiah Kofi
Hobden Mervyn
Hobden Peter
Hunter Andrew
Meng Hongying
Pettit Cy
Priestley Nigel
Yue Shigang
Publication venue
Publication date: 14/06/2009
Field of study

A binary Self Organizing Map (SOM) has been designed and implemented on a Field Programmable Gate Array (FPGA) chip. A novel learning algorithm which takes binary inputs and maintains tri-state weights is presented. The binary SOM has the capability of recognizing binary input sequences after training. A novel tri-state rule is used in updating the network weights during the training phase. The rule implementation is highly suited to the FPGA architecture, and allows extremely rapid training. This architecture may be used in real-time for fast pattern clustering and classification of the binary features

University of Lincoln Institutional Repository

Crossref

Nottingham Trent Institutional Repository (IRep)