Search CORE

579 research outputs found

Multi-threaded Geant4 on the Xeon-Phi with Complex High-Energy Physics Geometry

Author: Asai Makoto
Calafiura Paolo
Dotti Andrea
Farrell Steven
Monnard Romain
Publication venue
Publication date: 01/10/2015
Field of study

To study the performance of multi-threaded Geant4 for high-energy physics experiments, an application has been developed which generalizes and extends previous work. A highly-complex detector geometry is used for benchmarking on an Intel Xeon Phi coprocessor. In addition, an implementation of parallel I/O based on Intel SCIF and ROOT technologies is incorporated and studied

arXiv.org e-Print Archive

eScholarship - University of California

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Author: Goh Rick Siow Mong
He Bingsheng
Huynh Huynh Phung
Huynh Richard
Liang Yun
Lu Mian
Ong Zhongliang
Zhang Lei
Publication venue
Publication date: 01/01/2013
Field of study

With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

arXiv.org e-Print Archive

Crossref

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Author: Lujan Mikel
Nisbet Andy
Pop Antoniu
Rodchenko Andrey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

The University of Manchester - Institutional Repository

Partition Around Medoids Clustering on the Intel Xeon Phi Many-Core Coprocessor

Author: Rechkalov T. V.
Publication venue: Уральский федеральный университет
Publication date: 01/01/2015
Field of study

Abstract. The paper touches upon the problem of implementation Partition Around Medoids (PAM) clustering algorithm for the Intel Many Integrated Core architecture. PAM is a form of well-known k-Medoids clustering algorithm and is applied in various subject domains, e.g. bioinformatics, text analysis, intelligent transportation systems, etc. An optimized version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP parallelizing technology, loop vectorization, tiling technique and efficient distance matrix computation for Euclidean metric are used. Experimental results for different data sets confirm the efficiency of the proposed algorithm

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Author: Broquedis Francois
Ferreira Lima Joao Vicente
Gautier Thierry
Raffin Bruno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2013
Field of study

International audienceThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems

Author: Zymbler Mikhail
Publication venue
Publication date: 01/01/2018
Field of study

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.Comment: Accepted for publication in Journal of Computing and Information Technology (http://cit.fer.hr

arXiv.org e-Print Archive

Directory of Open Access Journals

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia