4 research outputs found
FPGA ACCELERATION OF A CORTICAL AND A MATCHED FILTER-BASED ALGORITHM
Digital image processing is a widely used and diverse field. It is used in a broad array of areas such as tracking and detection, object avoidance, computer vision, and numerous other applications. For many image processing tasks, the computations can become time consuming. Therefore, a means for accelerating the computations would be beneficial. Using that as motivation, this thesis examines the acceleration of two distinctly different image processing applications. The first image processing application examined is a recent neocortex inspired cognitive model geared towards pattern recognition as seen in the visual cortex. For this model, both software and reconfigurable logic based FPGA implementations of the model are examined on a Cray XD1. Results indicate that hardware-acceleration can provide average throughput gains of 75 times over software-only implementations of the networks examined when utilizing the full resources of the Cray XD1. The second image processing application examined is matched filter-based position detection. This approach is at the heart of the automatic alignment algorithm currently being tested in the National Ignition Faculty presently under construction at the Lawrence Livermore National Laboratory. To reduce the processing time of the match filtering, a reconfigurable logic architecture was developed. Results show that the reconfigurable logic architecture provides a speedup of approximately 253 times over an optimized software implementation
Accelerating Pattern Recognition Algorithms On Parallel Computing Architectures
The move to more parallel computing architectures places more responsibility on the programmer to achieve greater performance. The programmer must now have a greater understanding of the underlying architecture and the inherent algorithmic parallelism. Using parallel computing architectures for exploiting algorithmic parallelism can be a complex task. This dissertation demonstrates various techniques for using parallel computing architectures to exploit algorithmic parallelism. Specifically, three pattern recognition (PR) approaches are examined for acceleration across multiple parallel computing architectures, namely field programmable gate arrays (FPGAs) and general purpose graphical processing units (GPGPUs). Phase-only filter correlation for fingerprint identification was studied as the first PR approach. This approach\u27s sensitivity to angular rotations, scaling, and missing data was surveyed. Additionally, a novel FPGA implementation of this algorithm was created using fixed point computations, deep pipelining, and four computation phases. Communication and computation were overlapped to efficiently process large fingerprint galleries. The FPGA implementation showed approximately a 47 times speedup over a central processing unit (CPU) implementation with negligible impact on precision. For the second PR approach, a spiking neural network (SNN) algorithm for a character recognition application was examined. A novel FPGA implementation of the approach was developed incorporating a scalable modular SNN processing element (PE) to efficiently perform neural computations. The modular SNN PE incorporated streaming memory, fixed point computation, and deep pipelining. This design showed speedups of approximately 3.3 and 8.5 times over CPU implementations for 624 and 9,264 sized neural networks, respectively. Results indicate that the PE design could scale to process larger sized networks easily. Finally for the third PR approach, cellular simultaneous recurrent networks (CSRNs) were investigated for GPGPU acceleration. Particularly, the applications of maze traversal and face recognition were studied. Novel GPGPU implementations were developed employing varying quantities of task-level, data-level, and instruction-level parallelism to achieve efficient runtime performance. Furthermore, the performance of the face recognition application was examined across a heterogeneous cluster of multi-core and GPGPU architectures. A combination of multi-core processors and GPGPUs achieved roughly a 996 times speedup over a single-core CPU implementation. From examining these PR approaches for acceleration, this dissertation presents useful techniques and insight applicable to other algorithms to improve performance when designing a parallel implementation
FPGA implementation of a Restricted Boltzmann Machine for handwriting recognition
Despite the recent success of neural network in the research eld, the num-
ber of resulting applications for non-academic settings is very limited. One
setback for its popularity is that neural networks are typically implemented
as software running on a general-purpose processor. The time complexity
of the software implementation is usually O(n2). As a result, neural net-
works are inadequate to meet the scalability and performance requirements
for commercial or industrial uses. Several research works have dealt with
accelerating neural networks on Field-Programmable Gate Arrays (FPGAs),
particularly for Restricted Boltzmann Machines (RBMs) | a very popular
and hardware-friendly neural network model. However, when using their
implementations for handwriting recognition, there are two major setbacks.
First, the implementations assume that the sizes of the neural networks are
symmetric, while the size of RBM model for handwriting recognition is in
fact highly asymmetric. Second, these implementations cannot t a model
with a visible layer larger than 512 nodes on a single FPGA. Thus, they are
highly ine cient when apply to handwriting recognition application.
In this thesis, a new framework was proposed for an RBM with asymmetric
weights optimizing for handwriting recognition. The framework is tested on
an Altera Stratix IV GX(EP4SGX230KF40C2) FPGA running at 100 MHz.
The resources support a complete RBM model of 784 by 10 nodes. The
experimental results show the computational speed of 4 billion connection-
update-per-second and a speed-up of 134 fold with I/O time and a speed-
up of 161 fold without I/O time compared with an optimized MATLAB
implementation running on a 2.50 GHz Intel processor. Compared with
previous works, our implementation is able to achieve a much higher speed-
up while maintaining comparable resources used