11,285 research outputs found
GLoP: Enabling Massively Parallel Incident Response Through GPU Log Processing
Large industrial systems that combine services and applications, have become
targets for cyber criminals and are challenging from the security, monitoring
and auditing perspectives. Security log analysis is a key step for uncovering
anomalies, detecting intrusion, and enabling incident response. The constant
increase of link speeds, threats and users, produce large volumes of log data
and become increasingly difficult to analyse on a Central Processing Unit
(CPU). This paper presents a massively parallel Graphics Processing Unit (GPU)
LOg Processing (GLoP) library and can also be used for Deep Packet Inspection
(DPI), using a prefix matching technique, harvesting the full power of
off-the-shelf technologies. GLoP implements two different algorithm using
different GPU memory and is compared against CPU counterpart implementations.
The library can be used for processing nodes with single or multiple GPUs as
well as GPU cloud farms. The results show throughput of 20Gbps and demonstrate
that modern GPUs can be utilised to increase the operational speed of large
scale log processing scenarios, saving precious time before and after an
intrusion has occurred.Comment: Published in The 7th International Conference of Security of
Information and Networks, SIN 2014, Glasgow, UK, September, 201
ODYS: A Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS
Recently, parallel search engines have been implemented based on scalable
distributed file systems such as Google File System. However, we claim that
building a massively-parallel search engine using a parallel DBMS can be an
attractive alternative since it supports a higher-level (i.e., SQL-level)
interface than that of a distributed file system for easy and less error-prone
application development while providing scalability. In this paper, we propose
a new approach of building a massively-parallel search engine using a DB-IR
tightly-integrated parallel DBMS and demonstrate its commercial-level
scalability and performance. In addition, we present a hybrid (i.e., analytic
and experimental) performance model for the parallel search engine. We have
built a five-node parallel search engine according to the proposed architecture
using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show
the correctness of the model by comparing the projected output with the
experimental results of the five-node engine. Our model demonstrates that ODYS
is capable of handling 1 billion queries per day (81 queries/sec) for 30
billion web pages by using only 43,472 nodes with an average query response
time of 211 ms, which is equivalent to or better than those of commercial
search engines. We also show that, by using twice as many (86,944) nodes, ODYS
can provide an average query response time of 162 ms, which is significantly
lower than those of commercial search engines.Comment: 34 pages, 13 figure
PRINS: Resistive CAM Processing in Storage
Near-data in-storage processing research has been gaining momentum in recent
years. Typical processing-in-storage architecture places a single or several
processing cores inside the storage and allows data processing without
transferring it to the host CPU. Since this approach replicates von Neumann
architecture inside storage, it is exposed to the problems faced by von Neumann
architecture, especially the bandwidth wall. We present PRINS, a novel in-data
processing-in-storage architecture based on Resistive Content Addressable
Memory (RCAM). PRINS functions simultaneously as a storage and a massively
parallel associative processor. PRINS alleviates the bandwidth wall faced by
conventional processing-in-storage architectures by keeping the computing
inside the storage arrays, thus implementing in-data, rather than near-data,
processing. We show that PRINS may outperform a reference computer architecture
with a bandwidth-limited external storage. The performance of PRINS Euclidean
distance, dot product and histogram implementation exceeds the attainable
performance of a reference architecture by up to four orders of magnitude,
depending on the dataset size. The performance of PRINS SpMV may exceed the
attainable performance of such reference architecture by more than two orders
of magnitude
Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers
High-performance computing systems are more and more often based on
accelerators. Computing applications targeting those systems often follow a
host-driven approach in which hosts offload almost all compute-intensive
sections of the code onto accelerators; this approach only marginally exploits
the computational resources available on the host CPUs, limiting performance
and energy efficiency. The obvious step forward is to run compute-intensive
kernels in a concurrent and balanced way on both hosts and accelerators. In
this paper we consider exactly this problem for a class of applications based
on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our
goal is to develop just one program, portable and able to run efficiently on
several different combinations of hosts and accelerators. To reach this goal,
we define common data layouts enabling the code to exploit efficiently the
different parallel and vector options of the various accelerators, and matching
the possibly different requirements of the compute-bound and memory-bound
kernels of the application. We also define models and metrics that predict the
best partitioning of workloads among host and accelerator, and the optimally
achievable overall performance level. We test the performance of our codes and
their scaling properties using as testbeds HPC clusters incorporating different
accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs
Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method
We present Sailfish, an open source fluid simulation package implementing the
lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using
CUDA/OpenCL. We take a novel approach to GPU code implementation and use
run-time code generation techniques and a high level programming language
(Python) to achieve state of the art performance, while allowing easy
experimentation with different LBM models and tuning for various types of
hardware. We discuss the general design principles of the code, scaling to
multiple GPUs in a distributed environment, as well as the GPU implementation
and optimization of many different LBM models, both single component (BGK, MRT,
ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents
results of performance benchmarks spanning the last three NVIDIA GPU
generations (Tesla, Fermi, Kepler), which we hope will be useful for
researchers working with this type of hardware and similar codes.Comment: 36 pages, 15 figure
Fault tolerant Quantum Information Processing with Holographic control
We present a fault-tolerant semi-global control strategy for universal
quantum computers. We show that N-dimensional array of qubits where only
(N-1)-dimensional addressing resolution is available is compatible with
fault-tolerant universal quantum computation. What is more, we show that
measurements and individual control of qubits are required only at the
boundaries of the fault-tolerant computer, i.e. holographic fault-tolerant
quantum computation. Our model alleviates the heavy physical conditions on
current qubit candidates imposed by addressability requirements and represents
an option to improve their scalability.Comment: 20 pages. Comments are welcom
Scalable GW software for quasiparticle properties using OpenAtom
The GW method, which can describe accurately electronic excitations, is one
of the most widely used ab initio electronic structure technique and allows the
physics of both molecular and condensed phase materials to be studied. However,
the applications of the GW method to large systems require supercomputers and
highly parallelized software to overcome the high computational complexity of
the method scaling as . Here, we develop efficient massively-parallel
GW software for the plane-wave basis set by revisiting the standard GW formulae
in order to discern the optimal approaches for each phase of the GW calculation
for massively parallel computation. These best numerical practices are
implemented into the OpenAtom software which is written on top of charm++
parallel framework. We then evaluate the performance of our new software using
range of system sizes. Our GW software shows significantly improved parallel
scaling compared to publically available GW software on the Mira and Blue
Waters supercomputers, two of largest most powerful platforms in the world.Comment: 48 pages, 10 figure
A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions
In this work we present an efficient implementation of Canonical Monte Carlo
simulation for Coulomb many body systems on graphics processing units (GPU).
Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD)
architectures. It adopts the sequential updating scheme of Metropolis
algorithm, and makes no approximation in the computation of energy. It reaches
a remarkable 440-fold speedup, compared with the serial implementation on CPU.
We use this method to simulate primitive model electrolytes. We measure very
precisely all ion-ion pair correlation functions at high concentrations, and
extract renormalized Debye length, renormalized valences of constituent ions,
and renormalized dielectric constants. These results demonstrate unequivocally
physics beyond the classical Poisson-Boltzmann theory
A massively parallel algorithm for constructing the BWT of large string sets
We present a new scalable, lightweight algorithm to incrementally construct
the BWT and FM-index of large string sets such as those produced by Next
Generation Sequencing. The algorithm is designed for massive parallelism and
can effectively exploit the combination of low capacity high bandwidth memory
and slower external system memory typical of GPU accelerated systems.
Particularly, for a string set of n characters from an alphabet with \sigma
symbols, it uses a constant amount of high-bandwidth memory and at most 3n
log(\sigma) bits of system memory. Given that deep memory hierarchies are
becoming a pervasive trait of high performance computing architectures, we
believe this to be a relevant feature. The implementation can handle reads of
arbitrary length and is up to 2 and respectively 6.5 times faster than
state-of-the-art for short and long genomic read
Making the case of GPUs in courses on computational physics
Most relatively modern desktop or even laptop computers contain a graphics
card useful for more than showing colors on a screen. In this paper, we make a
case for why you should learn enough about GPU (graphics processing unit)
computing to use as an accelerator or even replacement to your CPU code. We
include an example of our own as a case study to show what can be realistically
expected.Comment: 11 pages, 2 figure
- …