141 research outputs found
Network Traffic Anomaly-Detection Framework Using GPUs
Network security has been very crucial for the software industry. Deep packet inspection (DPI) is one of the widely used approaches in enforcing network security. Due to the high volume of network traffic, it is challenging to achieve high performance for DPI in real time. In this thesis, a new DPI framework is presented that accelerates packet header checking and payload inspection on graphics processing units (GPUs). Various optimizations were applied to GPU-version packet inspection, such as thread-level and block-level packet assignment, warp divergence elimination, and memory transfer optimization using pinned memory and shared memory. The performance of the pattern-matching algorithms used for DPI was analyzed by using an assorted set of characteristics such as pipeline stalls, shared memory efficiency, warp efficiency, issue slot utilization, and cache hits. The extensive characterization of the algorithms on the GPU architecture and the performance comparison among parallel pattern-matching algorithms on both the GPU and the CPU are the unique contributions of this thesis. Among the GPU-version algorithms, the Aho-Corasick algorithm and the Wu-Manber algorithm outperformed the Rabin-Karp algorithm because the Aho-Corasick and the Wu-Manber algorithms were executed only once for multiple signatures by using the tables generated before the searching phase was begun. According to my evaluation on a NVIDIA K80 GPU, the GPU-accelerated packet processing achieved at least 60 times better performance than CPU-version processing
Recommended from our members
GPU-Acceleration of In-Memory Data Analytics
Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics.
Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor
Lightweight speculative support for aggressive auto-parallelisation tools
With the recent move to multi-core architectures it has become important to create the
means to exploit the performance made available to us by these architectures. Unfortunately
parallel programming is often a difficult and time-intensive process, even
to expert programmers. Auto-parallelisation tools have aimed to fill the performance
gap this has created, but static analysis commonly employed by such tools are unable
to provide the performance improvements required due to lack of information at
compile-time. More recent aggressive parallelisation tools use profiled-execution to
discover new parallel opportunities, but these tools are inherently unsafe. They require
either manual confirmation that their changes are safe, completely ruling out auto-parallelisation,
or they rely upon speculative execution such as software thread-level
speculation (SW-TLS) to confirm safe execution at runtime.
SW-TLS schemes are currently very heavyweight and often fail to provide speedups for
a program. Performance gains are dependent upon suitable parallel opportunities, correct
selection and configuration, and appropriate execution platforms. Little research
has been completed into the automated implemention of SW-TLS programs.
This thesis presents an automated, machine-learning based technique to select and configure
suitable speculation schemes when appropriate. This is performed by extracting
metrics from potential parallel opportunities and using them to determine if a loop is
suitable for speculative execution and if so, which speculation policy should be used.
An extensive evaluation of this technique is presented, verifying that SW-TLS configuration
can indeed be automated and provide reliable performance gains. This work
has shown that on an 8-core machine, up to 7.75X and a geometric mean of 1.64X
speedups can be obtained through automatic configuration, providing on average 74%
of the speedup obtainable through manual configuration.
Beyond automated configuration, this thesis explores the idea that many SW-TLS
schemes focus too heavily on recovery from detecting a dependence violation. Doing
so often results in worse than sequential performance for many real-world applications,
therefore this work hypothesises that for many highly-likely parallel candidates,
discovered through aggressive parallelisation techniques, would benefit from a simple
dependence check without the ability to roll back. Dependence violations become
extremely expensive in this scenario, however this would be incredibly rare. With a
thorough evaluation of the technique this thesis confirms the hypothesis whilst achieving speedups of up to 22.53X, and a geometric mean of 2.16X on a 32-core machine.
In a competitive scheduling scenario performance loss can be restricted to at least sequential
speeds, even when a dependence has been detected.
As a means to lower costs further this thesis explores other platforms to aid in the execution
of speculative error checking. Introduced is the use of a GPU to offload some of
the costs to during execution that confirms that using an auxiliary device is a legitimate
means to obtain further speedup. Evaluation demonstrates that doing so can achieve
up to 14.74X and a geometric mean of 1.99X speedup on a 12-core hyperthreaded machine.
Compared to standard CPU-only techniques this performs slightly slower with
a geometric mean of 0.96X speedup, however this is likely to improve with upcoming
GPU designs.
With the knowledge that GPU’s can be used to reduce speculation costs, this thesis
also investigates their use to speculatively improve execution times also. Presented
is a novel SW-TLS scheme that targets GPU-based execution for use with aggressive
auto-parallelisers. This scheme is executed using a competitive scheduling model, ensuring
performance is no lower than sequential execution, whilst being able to provide
speedups of up to 99X and on average 3.2X over sequential. On average this technique
outperformed static analysis alone by a factor of 7X and achieved approximately 99%
of the speedup obtained from manual parallel implementations and outperformed the
state-of-the-art in GPU SW-TLS by a factor of 1.45
Synthesis of Embedded Software using Dataflow Schedule Graphs
In the design and implementation of digital signal processing (DSP) systems,
dataflow is recognized as a natural model for specifying applications, and
dataflow enables useful model-based methodologies for analysis, synthesis, and
optimization of implementations. A wide range of embedded signal processing
applications can be designed efficiently using the high level abstractions that
are provided by dataflow programming models. In addition to their use in
parallelizing computations for faster execution, dataflow graphs have
additional advantages that stem from their modularity and formal foundation.
An important problem in the development of dataflow-based design tools is the
automated synthesis of software from dataflow representations.
In this thesis, we develop new software synthesis techniques for dataflow based
design and implementation of signal processing systems. An important task in
software synthesis from dataflow graphs is that of {\em scheduling}. Scheduling
refers to the assignment of actors to processing resources and the ordering of
actors that share the same resource. Scheduling typically involves very complex
design spaces, and has a significant impact on most relevant implementation
metrics, including latency, throughput, energy consumption, and memory
requirements. In this thesis, we integrate a model-based representation,
called the {\em dataflow schedule graph} ({\em DSG}), into the software
synthesis process. The DSG approach allows designers to model a schedule for a
dataflow graph as a separate dataflow graph, thereby providing a formal,
abstract (platform- and language-independent) representation for the schedule.
While we demonstrate this DSG-integrated software synthesis capability by
translating DSGs into OpenCL implementations, the use of a model-based schedule
representation makes the approach readily retargetable to other implementation
languages. We also investigate a number of optimization techniques to improve
the efficiency of software that is synthesized from DSGs.
Through experimental evaluation of the generated software, we demonstrate the
correctness and efficiency of our new techniques for dataflow-based
software synthesis and optimization
Doctor of Philosophy
dissertationHigh Performance Computing (HPC) on-node parallelism is of extreme importance to guarantee and maintain scalability across large clusters of hundreds of thousands of multicore nodes. HPC programming is dominated by the hybrid model "MPI + X", with MPI to exploit the parallelism across the nodes, and "X" as some shared memory parallel programming model to accomplish multicore parallelism across CPUs or GPUs. OpenMP has become the "X" standard de-facto in HPC to exploit the multicore architectures of modern CPUs. Data races are one of the most common and insidious of concurrent errors in shared memory programming models and OpenMP programs are not immune to them. The OpenMP-provided ease of use to parallelizing programs can often make it error-prone to data races which become hard to find in large applications with thousands lines of code. Unfortunately, prior tools are unable to impact practice owing to their poor coverage or poor scalability. In this work, we develop several new approaches for low overhead data race detection. Our approaches aim to guarantee high precision and accuracy of race checking while maintaining a low runtime and memory overhead. We present two race checkers for C/C++ OpenMP programs that target two different classes of programs. The first, ARCHER, is fast but requires large amount of memory, so it ideally targets applications that require only a small portion of the available on-node memory. On the other hand, SWORD strikes a balance between fast zero memory overhead data collection followed by offline analysis that can take a long time, but it often report most races quickly. Given that race checking was impossible for large OpenMP applications, our contributions are the best available advances in what is known to be a difficult NP-complete problem. We performed an extensive evaluation of the tools on existing OpenMP programs and HPC benchmarks. Results show that both tools guarantee to identify all the races of a program in a given run without reporting any false alarms. The tools are user-friendly, hence serve as an important instrument for the daily work of programmers to help them identify data races early during development and production testing. Furthermore, our demonstrated success on real-world applications puts these tools on the top list of debugging tools for scientists at large
Analysis and application of Fourier-Motzkin variable elimination to program optimization : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
This thesis examines four of the most influential dependence analysis techniques in use by optimizing compilers:
Fourier-Motzkin Variable Elimination, the Banerjee Bounds Test, the Omega Test, and the I-Test.
Although the performance and effectiveness of these tests have previously been documented empirically,
no in-depth analysis of how these techniques are related from a purely analytical perspective has been done.
The analysis given here clarifies important aspects of the empirical results that were noted but never fully
explained. A tighter bound on the performance of one of the Omega Test algorithms than was known previously
is proved and a link is shown between the integer refinement technique used in the Omega Test
and the well-known Frobenius Coin Problem. The application of a Fourier-Motzkin based algorithm to the
elimination of redundant bound checks in Java bytecode is described. A system which incorporated this
technique improved performance on the Java Grande Forum Benchmark Suite by up to 10 percent
Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures
The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The C++ template interface provided allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization
Communication reduction techniques in numerical methods and deep neural networks
Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version
- …