3,047 research outputs found
GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP
Full detector simulation was among the largest CPU consumer in all CERN
experiment software stacks for the first two runs of the Large Hadron Collider
(LHC). In the early 2010's, the projections were that simulation demands would
scale linearly with luminosity increase, compensated only partially by an
increase of computing resources. The extension of fast simulation approaches to
more use cases, covering a larger fraction of the simulation budget, is only
part of the solution due to intrinsic precision limitations. The remainder
corresponds to speeding-up the simulation software by several factors, which is
out of reach using simple optimizations on the current code base. In this
context, the GeantV R&D project was launched, aiming to redesign the legacy
particle transport codes in order to make them benefit from fine-grained
parallelism features such as vectorization, but also from increased code and
data locality. This paper presents extensively the results and achievements of
this R&D, as well as the conclusions and lessons learnt from the beta
prototype.Comment: 34 pages, 26 figures, 24 table
A highly optimized vectorized code for Monte Carlo simulations of SU(3) lattice gauge theories
New methods are introduced for improving the performance of the vectorized Monte Carlo SU(3) lattice gauge theory algorithm using the CDC CYBER 205. Structure, algorithm and programming considerations are discussed. The performance achieved for a 16(4) lattice on a 2-pipe system may be phrased in terms of the link update time or overall MFLOPS rates. For 32-bit arithmetic, it is 36.3 microsecond/link for 8 hits per iteration (40.9 microsecond for 10 hits) or 101.5 MFLOPS
DOLPHIn - Dictionary Learning for Phase Retrieval
We propose a new algorithm to learn a dictionary for reconstructing and
sparsely encoding signals from measurements without phase. Specifically, we
consider the task of estimating a two-dimensional image from squared-magnitude
measurements of a complex-valued linear transformation of the original image.
Several recent phase retrieval algorithms exploit underlying sparsity of the
unknown signal in order to improve recovery performance. In this work, we
consider such a sparse signal prior in the context of phase retrieval, when the
sparsifying dictionary is not known in advance. Our algorithm jointly
reconstructs the unknown signal - possibly corrupted by noise - and learns a
dictionary such that each patch of the estimated image can be sparsely
represented. Numerical experiments demonstrate that our approach can obtain
significantly better reconstructions for phase retrieval problems with noise
than methods that cannot exploit such "hidden" sparsity. Moreover, on the
theoretical side, we provide a convergence result for our method
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
Runtime Optimizations for Prediction with Tree-Based Models
Tree-based models have proven to be an effective solution for web ranking as
well as other problems in diverse domains. This paper focuses on optimizing the
runtime performance of applying such models to make predictions, given an
already-trained model. Although exceedingly simple conceptually, most
implementations of tree-based models do not efficiently utilize modern
superscalar processor architectures. By laying out data structures in memory in
a more cache-conscious fashion, removing branches from the execution flow using
a technique called predication, and micro-batching predictions using a
technique called vectorization, we are able to better exploit modern processor
architectures and significantly improve the speed of tree-based models over
hard-coded if-else blocks. Our work contributes to the exploration of
architecture-conscious runtime implementations of machine learning algorithms
ExoCross: a general program for generating spectra from molecular line lists
ExoCross is a Fortran code for generating spectra (emission, absorption) and
thermodynamic properties (partition function, specific heat etc.) from
molecular line lists. Input is taken in several formats, including ExoMol and
HITRAN formats. ExoCross is efficiently parallelized showing also a high degree
of vectorization. It can work with several line profiles such as Doppler,
Lorentzian and Voigt and support several broadening schemes. Voigt profiles are
handled by several methods allowing fast and accurate simulations. Two of these
methods are new. ExoCross is also capable of working with the recently proposed
method of super-lines. It supports calculations of lifetimes, cooling
functions, specific heats and other properties. ExoCross can be used to convert
between different formats, such as HITRAN, ExoMol and Phoenix. It is capable of
simulating non-LTE spectra using a simple two-temperature approach. Different
electronic, vibronic or vibrational bands can be simulated separately using an
efficient filtering scheme based on the quantum numbers
- …