11,285 research outputs found

    GLoP: Enabling Massively Parallel Incident Response Through GPU Log Processing

    Full text link
    Large industrial systems that combine services and applications, have become targets for cyber criminals and are challenging from the security, monitoring and auditing perspectives. Security log analysis is a key step for uncovering anomalies, detecting intrusion, and enabling incident response. The constant increase of link speeds, threats and users, produce large volumes of log data and become increasingly difficult to analyse on a Central Processing Unit (CPU). This paper presents a massively parallel Graphics Processing Unit (GPU) LOg Processing (GLoP) library and can also be used for Deep Packet Inspection (DPI), using a prefix matching technique, harvesting the full power of off-the-shelf technologies. GLoP implements two different algorithm using different GPU memory and is compared against CPU counterpart implementations. The library can be used for processing nodes with single or multiple GPUs as well as GPU cloud farms. The results show throughput of 20Gbps and demonstrate that modern GPUs can be utilised to increase the operational speed of large scale log processing scenarios, saving precious time before and after an intrusion has occurred.Comment: Published in The 7th International Conference of Security of Information and Networks, SIN 2014, Glasgow, UK, September, 201

    ODYS: A Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS

    Full text link
    Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS and demonstrate its commercial-level scalability and performance. In addition, we present a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We have built a five-node parallel search engine according to the proposed architecture using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show the correctness of the model by comparing the projected output with the experimental results of the five-node engine. Our model demonstrates that ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion web pages by using only 43,472 nodes with an average query response time of 211 ms, which is equivalent to or better than those of commercial search engines. We also show that, by using twice as many (86,944) nodes, ODYS can provide an average query response time of 162 ms, which is significantly lower than those of commercial search engines.Comment: 34 pages, 13 figure

    PRINS: Resistive CAM Processing in Storage

    Full text link
    Near-data in-storage processing research has been gaining momentum in recent years. Typical processing-in-storage architecture places a single or several processing cores inside the storage and allows data processing without transferring it to the host CPU. Since this approach replicates von Neumann architecture inside storage, it is exposed to the problems faced by von Neumann architecture, especially the bandwidth wall. We present PRINS, a novel in-data processing-in-storage architecture based on Resistive Content Addressable Memory (RCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS alleviates the bandwidth wall faced by conventional processing-in-storage architectures by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS may outperform a reference computer architecture with a bandwidth-limited external storage. The performance of PRINS Euclidean distance, dot product and histogram implementation exceeds the attainable performance of a reference architecture by up to four orders of magnitude, depending on the dataset size. The performance of PRINS SpMV may exceed the attainable performance of such reference architecture by more than two orders of magnitude

    Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers

    Full text link
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs

    Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method

    Full text link
    We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes.Comment: 36 pages, 15 figure

    Fault tolerant Quantum Information Processing with Holographic control

    Full text link
    We present a fault-tolerant semi-global control strategy for universal quantum computers. We show that N-dimensional array of qubits where only (N-1)-dimensional addressing resolution is available is compatible with fault-tolerant universal quantum computation. What is more, we show that measurements and individual control of qubits are required only at the boundaries of the fault-tolerant computer, i.e. holographic fault-tolerant quantum computation. Our model alleviates the heavy physical conditions on current qubit candidates imposed by addressability requirements and represents an option to improve their scalability.Comment: 20 pages. Comments are welcom

    Scalable GW software for quasiparticle properties using OpenAtom

    Full text link
    The GW method, which can describe accurately electronic excitations, is one of the most widely used ab initio electronic structure technique and allows the physics of both molecular and condensed phase materials to be studied. However, the applications of the GW method to large systems require supercomputers and highly parallelized software to overcome the high computational complexity of the method scaling as O(N4)O(N^4). Here, we develop efficient massively-parallel GW software for the plane-wave basis set by revisiting the standard GW formulae in order to discern the optimal approaches for each phase of the GW calculation for massively parallel computation. These best numerical practices are implemented into the OpenAtom software which is written on top of charm++ parallel framework. We then evaluate the performance of our new software using range of system sizes. Our GW software shows significantly improved parallel scaling compared to publically available GW software on the Mira and Blue Waters supercomputers, two of largest most powerful platforms in the world.Comment: 48 pages, 10 figure

    A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions

    Full text link
    In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures. It adopts the sequential updating scheme of Metropolis algorithm, and makes no approximation in the computation of energy. It reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We use this method to simulate primitive model electrolytes. We measure very precisely all ion-ion pair correlation functions at high concentrations, and extract renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory

    A massively parallel algorithm for constructing the BWT of large string sets

    Full text link
    We present a new scalable, lightweight algorithm to incrementally construct the BWT and FM-index of large string sets such as those produced by Next Generation Sequencing. The algorithm is designed for massive parallelism and can effectively exploit the combination of low capacity high bandwidth memory and slower external system memory typical of GPU accelerated systems. Particularly, for a string set of n characters from an alphabet with \sigma symbols, it uses a constant amount of high-bandwidth memory and at most 3n log(\sigma) bits of system memory. Given that deep memory hierarchies are becoming a pervasive trait of high performance computing architectures, we believe this to be a relevant feature. The implementation can handle reads of arbitrary length and is up to 2 and respectively 6.5 times faster than state-of-the-art for short and long genomic read

    Making the case of GPUs in courses on computational physics

    Full text link
    Most relatively modern desktop or even laptop computers contain a graphics card useful for more than showing colors on a screen. In this paper, we make a case for why you should learn enough about GPU (graphics processing unit) computing to use as an accelerator or even replacement to your CPU code. We include an example of our own as a case study to show what can be realistically expected.Comment: 11 pages, 2 figure
    • …
    corecore