41 research outputs found
Improving the Efficiency of OpenCL Kernels through Pipes
In an effort to lower the barrier to the adoption of FPGAs by a broader
community, today major FPGA vendors offer compiler toolchains for OpenCL code.
While using these toolchain allows porting existing code to FPGAs, ensuring
performance portability across devices (i.e., CPUs, GPUs and FPGAs) is not a
trivial task. This is in part due to the different hardware characteristics of
these devices, including the nature of the hardware parallelism and the memory
bandwidth they offer. In particular, global memory accesses are known to be one
of the main performance bottlenecks for OpenCL kernels deployed on FPGA. In
this paper, we investigate the use of pipes to improve memory bandwidth
utilization and performance of OpenCL kernels running on FPGA. This is done by
separating the global memory accesses from the computation, enabling better use
of the load units required to access global memory. We perform experiments on a
set of broadly used benchmark applications with various compute and memory
access patterns. Our experiments, conducted on an Intel Arria GX board, show
that the proposed method is effective in improving the memory bandwidth
utilization of most kernels, particularly those exhibiting irregular memory
access patterns. This, in turn, leads to performance improvements, in some
cases significant
Memory-Efficient Regular Expression Search Using State Merging
Abstract — Pattern matching is a crucial task in several critical network services such as intrusion detection and policy man-agement. As the complexity of rule-sets increases, traditional string matching engines are being replaced by more sophisticated regular expression engines. To keep up with line rates, deal with denial of service attacks and provide predictable resource provisioning, the design of such engines must allow examining payload traffic at several gigabits per second and provide worst case speed guarantees. While regular expression matching using deterministic finite automata (DFA) is a well studied problem in theory, its implementation either in software or specialized hardware is complicated by prohibitive memory requirements. This is especially true for DFAs representing complex regular expressions present in practical rule-sets. In this paper, we introduce a novel method to drastically reduce the DFA memory requirement and still provide worst-case speed guarantees. Specifically, we merge several “non-equivalent” states in a DFA by introducing labels on their input and output transitions. We then propose a data structure to represent the merged states and the transition labels. We show that, with very few assumptions about the original DFA, such a transformation results in significant compression in the DFA representation. We have implemented a state merging and transition labeling algorithm for DFAs, and show that for Snort and Bro security rule-sets, state merging results in memory reductions of an order of magnitude. I
Accelerating large-scale protein structure alignments with graphics processing units
<p>Abstract</p> <p>Background</p> <p>Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons.</p> <p>Findings</p> <p>We present <it>ppsAlign</it>, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, <it>ppsAlign </it>could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated <it>ppsAlign </it>on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH.</p> <p>Conclusions</p> <p><it>ppsAlign </it>is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU.</p
Fused Breadth-First Probabilistic Traversals on Distributed GPU Systems
Probabilistic breadth-first traversals (BPTs) are used in many network
science and graph machine learning applications. In this paper, we are
motivated by the application of BPTs in stochastic diffusion-based graph
problems such as influence maximization. These applications heavily rely on
BPTs to implement a Monte-Carlo sampling step for their approximations. Given
the large sampling complexity, stochasticity of the diffusion process, and the
inherent irregularity in real-world graph topologies, efficiently parallelizing
these BPTs remains significantly challenging.
In this paper, we present a new algorithm to fuse massive number of
concurrently executing BPTs with random starts on the input graph. Our
algorithm is designed to fuse BPTs by combining separate traversals into a
unified frontier on distributed multi-GPU systems. To show the general
applicability of the fused BPT technique, we have incorporated it into two
state-of-the-art influence maximization parallel implementations (gIM and
Ripples). Our experiments on up to 4K nodes of the OLCF Frontier supercomputer
( GPUs and K CPU cores) show strong scaling behavior, and that
fused BPTs can improve the performance of these implementations up to
34 (for gIM) and ~360 (for Ripples).Comment: 12 pages, 11 figure
An Improved Algorithm to Accelerate Regular Expression Evaluation
Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While deterministic finite automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. In [9], Kumar et al. propose Delayed Input DFAs (D 2 FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which corresponds directly to the memory bandwidth required to evaluate regular expressions. In this paper we introduce a general compression technique that results in at most 2N state traversals when processing a string of length N. In comparison to the D 2 FA approach, our technique achieves comparable levels of compression, with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, our proposed algorithm has lower complexity, is suitable for scenarios where a compressed DFA needs to be dynamically built or updated, and fosters locality in the traversal process. Finally, we also describe a novel alphabet reduction scheme for DFA-based structures that can yield further dramatic reductions in data structure size
Dynamic thread assignment on heterogeneous multiprocessor architectures
In a multi-programmed computing environment, threads of execution exhibit different runtime characteristics and hardware resource requirements. Not only do the behaviors of distinct threads differ, but each thread may also present diversity in its performance and resource usage over time. A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. Prior work has shown that heterogeneous CMPs can meet the needs of a multi-programmed computing environment better than a homogeneous CMP system. In fact, the use of a combination of cores with different caches and instruction issue widths better accommodates threads with different computational requirements. A central issue in the design and use of heterogeneous systems is to determine an assignment of tasks to processors which better exploits the hardware resources in order to improve performance. In this paper we argue that the benefits of heterogeneous CMPs are bolstered by the usage of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between the cores. We validate our analysis by means of simulation. Specifically, our model assumes a combination of Alpha EV5 and Alpha EV6 processors and of integer and floating point programs from the SPEC2000 benchmark suite. We show that a dynamic assignment can outperform a static one by 20 % to 40 % on average and by as much as 80 % in extreme cases, depending on the degree of multithreading simulated
From Poisson Processes to Self-Similarity: a Survey of Network Traffic Models
The paper provides a survey of network traffic models. It starts from the description of the Poisson model, born in the context of telephony, and highlights the main reasons for its inadequacy to describe data traffic in LANs and WANs. It then details two models which have been conceived to overcome the Poisson model's limitations. In particular, the discussion focuses on the packet train model, validated in a Token Ring LAN, and on the self-similar model, used to capture traffic burstiness at several times scales in both Ethernet LANs and WANs. The discussion closes with some examples of usage of those models in LAN and WAN environments