14 research outputs found

    Multi-pattern Matching Technology Based on CUDA

    Get PDF
    文章以经典的多模式匹配算法—AC算法为例,通过对CudA特性的分析,提出了基于CudA的并行模型,设计了适合CudA并行技术的AC匹配算法。实验结果表明,基于CudA的AC匹配算法较CPu上获得了22倍的加速比,有效提高了入侵检测系统的性能。By analyzing the characteristic of CUDA, a parallel module of AC matching algorithm based on CUDA is proposed.Experiment shows that the AC multi-pattern matching algorithm based on GPU gets 22 times speedup ratio than it based on CPU, and it improves the performance of intrusion detection system effectively

    Traffic Classification over Gbit Speed with Commodity Hardware

    Get PDF
    This paper discusses necessary components of a GPU-assisted traffic classification method, which is capable ofmulti-Gbps speeds on commodity hardware. The majority of the traffic classification is pushed to the GPU to offload the CPU, which then may serve other processing intensive tasks, e.g., traffic capture. The paper presents two massively parallelizable algorithms suitable for GPUs. The first one performs signature search using a modification of Zobrist hashing. The second algorithm supports connection pattern-based analysis and aggregation of matches using a parallel-prefix-sum algorithm adapted to GPU.The performance tests of the proposed methods showed that traffic classification is possible up to approximately 6 Gbps with a commodity PC

    Network Traffic Anomaly-Detection Framework Using GPUs

    Get PDF
    Network security has been very crucial for the software industry. Deep packet inspection (DPI) is one of the widely used approaches in enforcing network security. Due to the high volume of network traffic, it is challenging to achieve high performance for DPI in real time. In this thesis, a new DPI framework is presented that accelerates packet header checking and payload inspection on graphics processing units (GPUs). Various optimizations were applied to GPU-version packet inspection, such as thread-level and block-level packet assignment, warp divergence elimination, and memory transfer optimization using pinned memory and shared memory. The performance of the pattern-matching algorithms used for DPI was analyzed by using an assorted set of characteristics such as pipeline stalls, shared memory efficiency, warp efficiency, issue slot utilization, and cache hits. The extensive characterization of the algorithms on the GPU architecture and the performance comparison among parallel pattern-matching algorithms on both the GPU and the CPU are the unique contributions of this thesis. Among the GPU-version algorithms, the Aho-Corasick algorithm and the Wu-Manber algorithm outperformed the Rabin-Karp algorithm because the Aho-Corasick and the Wu-Manber algorithms were executed only once for multiple signatures by using the tables generated before the searching phase was begun. According to my evaluation on a NVIDIA K80 GPU, the GPU-accelerated packet processing achieved at least 60 times better performance than CPU-version processing

    Requirements for Energy-Harvesting-Driven Edge Devices Using Task-Offloading Approaches

    Get PDF
    Energy limitations remain a key concern in the development of Internet of Medical Things (IoMT) devices since most of them have limited energy sources, mainly from batteries. Therefore, providing a sustainable and autonomous power supply is essential as it allows continuous energy sensing, flexible positioning, less human intervention, and easy maintenance. In the last few years, extensive investigations have been conducted to develop energy-autonomous systems for the IoMT by implementing energy-harvesting (EH) technologies as a feasible and economically practical alternative to batteries. To this end, various EH-solutions have been developed for wearables to enhance power extraction efficiency, such as integrating resonant energy extraction circuits such as SSHI, S-SSHI, and P-SSHI connected to common energy-storage units to maintain a stable output for charge loads. These circuits enable an increase in the harvested power by 174% compared to the SEH circuit. Although IoMT devices are becoming increasingly powerful and more affordable, some tasks, such as machine-learning algorithms, still require intensive computational resources, leading to higher energy consumption. Offloading computing-intensive tasks from resource-limited user devices to resource-rich fog or cloud layers can effectively address these issues and manage energy consumption. Reinforcement learning, in particular, employs the Q-algorithm, which is an efficient technique for hardware implementation, as well as offloading tasks from wearables to edge devices. For example, the lowest reported power consumption using FPGA technology is 37 mW. Furthermore, the communication cost from wearables to fog devices should not offset the energy savings gained from task migration. This paper provides a comprehensive review of joint energy-harvesting technologies and computation-offloading strategies for the IoMT. Moreover, power supply strategies for wearables, energy-storage techniques, and hardware implementation of the task migration were provided

    Accelerating Malware Detection via a Graphics Processing Unit

    Get PDF
    Real-time malware analysis requires processing large amounts of data storage to look for suspicious files. This is a time consuming process that (requires a large amount of processing power) often affecting other applications running on a personal computer. This research investigates the viability of using Graphic Processing Units (GPUs), present in many personal computers, to distribute the workload normally processed by the standard Central Processing Unit (CPU). Three experiments are conducted using an industry standard GPU, the NVIDIA GeForce 9500 GT card. The goal of the first experiment is to find the optimal number of threads per block for calculating MD5 file hash. The goal of the second experiment is to find the optimal number of threads per block for searching an MD5 hash database for matches. In the third experiment, the size of the executable, executable type (benign or malicious), and processing hardware are varied in a full factorial experimental design. The experiment records if the file is benign or malicious and measure the time required to identify the executable. This information can be used to analyze the performance of GPU hardware against CPU hardware. Experimental results show that a GPU can calculate a MD5 signature hash and scan a database of malicious signatures 82% faster than a CPU for files between 0 96 kB. If the file size is increased to 97 - 192 kB the GPU is 85% faster than the CPU. This demonstrates that the GPU can provide a greater performance increase over a CPU. These results could help achieve faster anti-malware products, faster network intrusion detection system response times, and faster firewall applications

    Hardware-Aware Algorithm Designs for Efficient Parallel and Distributed Processing

    Get PDF
    The introduction and widespread adoption of the Internet of Things, together with emerging new industrial applications, bring new requirements in data processing. Specifically, the need for timely processing of data that arrives at high rates creates a challenge for the traditional cloud computing paradigm, where data collected at various sources is sent to the cloud for processing. As an approach to this challenge, processing algorithms and infrastructure are distributed from the cloud to multiple tiers of computing, closer to the sources of data. This creates a wide range of devices for algorithms to be deployed on and software designs to adapt to.In this thesis, we investigate how hardware-aware algorithm designs on a variety of platforms lead to algorithm implementations that efficiently utilize the underlying resources. We design, implement and evaluate new techniques for representative applications that involve the whole spectrum of devices, from resource-constrained sensors in the field, to highly parallel servers. At each tier of processing capability, we identify key architectural features that are relevant for applications and propose designs that make use of these features to achieve high-rate, timely and energy-efficient processing.In the first part of the thesis, we focus on high-end servers and utilize two main approaches to achieve high throughput processing: vectorization and thread parallelism. We employ vectorization for the case of pattern matching algorithms used in security applications. We show that re-thinking the design of algorithms to better utilize the resources available in the platforms they are deployed on, such as vector processing units, can bring significant speedups in processing throughout. We then show how thread-aware data distribution and proper inter-thread synchronization allow scalability, especially for the problem of high-rate network traffic monitoring. We design a parallelization scheme for sketch-based algorithms that summarize traffic information, which allows them to handle incoming data at high rates and be able to answer queries on that data efficiently, without overheads.In the second part of the thesis, we target the intermediate tier of computing devices and focus on the typical examples of hardware that is found there. We show how single-board computers with embedded accelerators can be used to handle the computationally heavy part of applications and showcase it specifically for pattern matching for security-related processing. We further identify key hardware features that affect the performance of pattern matching algorithms on such devices, present a co-evaluation framework to compare algorithms, and design a new algorithm that efficiently utilizes the hardware features.In the last part of the thesis, we shift the focus to the low-power, resource-constrained tier of processing devices. We target wireless sensor networks and study distributed data processing algorithms where the processing happens on the same devices that generate the data. Specifically, we focus on a continuous monitoring algorithm (geometric monitoring) that aims to minimize communication between nodes. By deploying that algorithm in action, under realistic environments, we demonstrate that the interplay between the network protocol and the application plays an important role in this layer of devices. Based on that observation, we co-design a continuous monitoring application with a modern network stack and augment it further with an in-network aggregation technique. In this way, we show that awareness of the underlying network stack is important to realize the full potential of the continuous monitoring algorithm.The techniques and solutions presented in this thesis contribute to better utilization of hardware characteristics, across a wide spectrum of platforms. We employ these techniques on problems that are representative examples of current and upcoming applications and contribute with an outlook of emerging possibilities that can build on the results of the thesis
    corecore