849 research outputs found
The Potential of the Intel Xeon Phi for Supervised Deep Learning
Supervised learning of Convolutional Neural Networks (CNNs), also known as
supervised Deep Learning, is a computationally demanding process. To find the
most suitable parameters of a network for a given application, numerous
training sessions are required. Therefore, reducing the training time per
session is essential to fully utilize CNNs in practice. While numerous research
groups have addressed the training of CNNs using GPUs, so far not much
attention has been paid to the Intel Xeon Phi coprocessor. In this paper we
investigate empirically and theoretically the potential of the Intel Xeon Phi
for supervised learning of CNNs. We design and implement a parallelization
scheme named CHAOS that exploits both the thread- and SIMD-parallelism of the
coprocessor. Our approach is evaluated on the Intel Xeon Phi 7120P using the
MNIST dataset of handwritten digits for various thread counts and CNN
architectures. Results show a 103.5x speed up when training our large network
for 15 epochs using 244 threads, compared to one thread on the coprocessor.
Moreover, we develop a performance model and use it to assess our
implementation and answer what-if questions.Comment: The 17th IEEE International Conference on High Performance Computing
and Communications (HPCC 2015), Aug. 24 - 26, 2015, New York, US
Explorations of the viability of ARM and Xeon Phi for physics processing
We report on our investigations into the viability of the ARM processor and
the Intel Xeon Phi co-processor for scientific computing. We describe our
experience porting software to these processors and running benchmarks using
real physics applications to explore the potential of these processors for
production physics processing.Comment: Submitted to proceedings of the 20th International Conference on
Computing in High Energy and Nuclear Physics (CHEP13), Amsterda
An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor
Modern OpenMP threading techniques are used to convert the MPI-only
Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two
separate implementations that differ by the sharing or replication of key data
structures among threads are considered, density and Fock matrices. All
implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi
processors. With 64 cores per processor, scaling numbers are reported on up to
192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory
footprint by approximately 200 times compared to the legacy code. The
MPI/OpenMP code was shown to run up to six times faster than the original for a
range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Splotch: porting and optimizing for the Xeon Phi
With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel's coprocessor based upon the new Many Integrated Core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally performance is compared against results achieved with the GPU implementation of Splotch
- …