7,588 research outputs found
Ultrafast processing of pixel detector data with machine learning frameworks
Modern photon science performed at high repetition rate free-electron laser
(FEL) facilities and beyond relies on 2D pixel detectors operating at
increasing frequencies (towards 100 kHz at LCLS-II) and producing rapidly
increasing amounts of data (towards TB/s). This data must be rapidly stored for
offline analysis and summarized in real time. While at LCLS all raw data has
been stored, at LCLS-II this would lead to a prohibitive cost; instead,
enabling real time processing of pixel detector raw data allows reducing the
size and cost of online processing, offline processing and storage by orders of
magnitude while preserving full photon information, by taking advantage of the
compressibility of sparse data typical for LCLS-II applications. We
investigated if recent developments in machine learning are useful in data
processing for high speed pixel detectors and found that typical deep learning
models and autoencoder architectures failed to yield useful noise reduction
while preserving full photon information, presumably because of the very
different statistics and feature sets between computer vision and radiation
imaging. However, we redesigned in Tensorflow mathematically equivalent
versions of the state-of-the-art, "classical" algorithms used at LCLS. The
novel Tensorflow models resulted in elegant, compact and hardware agnostic
code, gaining 1 to 2 orders of magnitude faster processing on an inexpensive
consumer GPU, reducing by 3 orders of magnitude the projected cost of online
analysis at LCLS-II. Computer vision a decade ago was dominated by hand-crafted
filters; their structure inspired the deep learning revolution resulting in
modern deep convolutional networks; similarly, our novel Tensorflow filters
provide inspiration for designing future deep learning architectures for
ultrafast and efficient processing and classification of pixel detector images
at FEL facilities.Comment: 9 pages, 9 figure
Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor
With the ease-of-programming, flexibility and yet efficiency, MapReduce has
become one of the most popular frameworks for building big-data applications.
MapReduce was originally designed for distributed-computing, and has been
extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In
this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is
the latest product released by Intel based on the Many Integrated Core
Architecture. To the best of our knowledge, this is the first work to optimize
the MapReduce framework on the Xeon Phi.
In our work, we utilize advanced features of the Xeon Phi to achieve high
performance. In order to take advantage of the SIMD vector processing units, we
propose a vectorization friendly technique for the map phase to assist the
auto-vectorization as well as develop SIMD hash computation algorithms.
Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to
improve the resource utilization. We also eliminate multiple local arrays but
use low cost atomic operations on the global array for some applications, which
can improve the thread scalability and data locality due to the coherent L2
caches. Finally, for a given application, our framework can either
automatically detect suitable techniques to apply or provide guideline for
users at compilation time. We conduct comprehensive experiments to benchmark
the Xeon Phi and compare our optimized MapReduce framework with a
state-of-the-art multi-core based MapReduce framework (Phoenix++). By
evaluating six real-world applications, the experimental results show that our
optimized framework is 1.2X to 38X faster than Phoenix++ for various
applications on the Xeon Phi
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
- …