4 research outputs found

    FPGA ACCELERATION OF A CORTICAL AND A MATCHED FILTER-BASED ALGORITHM

    Get PDF
    Digital image processing is a widely used and diverse field. It is used in a broad array of areas such as tracking and detection, object avoidance, computer vision, and numerous other applications. For many image processing tasks, the computations can become time consuming. Therefore, a means for accelerating the computations would be beneficial. Using that as motivation, this thesis examines the acceleration of two distinctly different image processing applications. The first image processing application examined is a recent neocortex inspired cognitive model geared towards pattern recognition as seen in the visual cortex. For this model, both software and reconfigurable logic based FPGA implementations of the model are examined on a Cray XD1. Results indicate that hardware-acceleration can provide average throughput gains of 75 times over software-only implementations of the networks examined when utilizing the full resources of the Cray XD1. The second image processing application examined is matched filter-based position detection. This approach is at the heart of the automatic alignment algorithm currently being tested in the National Ignition Faculty presently under construction at the Lawrence Livermore National Laboratory. To reduce the processing time of the match filtering, a reconfigurable logic architecture was developed. Results show that the reconfigurable logic architecture provides a speedup of approximately 253 times over an optimized software implementation

    Accelerating Pattern Recognition Algorithms On Parallel Computing Architectures

    Get PDF
    The move to more parallel computing architectures places more responsibility on the programmer to achieve greater performance. The programmer must now have a greater understanding of the underlying architecture and the inherent algorithmic parallelism. Using parallel computing architectures for exploiting algorithmic parallelism can be a complex task. This dissertation demonstrates various techniques for using parallel computing architectures to exploit algorithmic parallelism. Specifically, three pattern recognition (PR) approaches are examined for acceleration across multiple parallel computing architectures, namely field programmable gate arrays (FPGAs) and general purpose graphical processing units (GPGPUs). Phase-only filter correlation for fingerprint identification was studied as the first PR approach. This approach\u27s sensitivity to angular rotations, scaling, and missing data was surveyed. Additionally, a novel FPGA implementation of this algorithm was created using fixed point computations, deep pipelining, and four computation phases. Communication and computation were overlapped to efficiently process large fingerprint galleries. The FPGA implementation showed approximately a 47 times speedup over a central processing unit (CPU) implementation with negligible impact on precision. For the second PR approach, a spiking neural network (SNN) algorithm for a character recognition application was examined. A novel FPGA implementation of the approach was developed incorporating a scalable modular SNN processing element (PE) to efficiently perform neural computations. The modular SNN PE incorporated streaming memory, fixed point computation, and deep pipelining. This design showed speedups of approximately 3.3 and 8.5 times over CPU implementations for 624 and 9,264 sized neural networks, respectively. Results indicate that the PE design could scale to process larger sized networks easily. Finally for the third PR approach, cellular simultaneous recurrent networks (CSRNs) were investigated for GPGPU acceleration. Particularly, the applications of maze traversal and face recognition were studied. Novel GPGPU implementations were developed employing varying quantities of task-level, data-level, and instruction-level parallelism to achieve efficient runtime performance. Furthermore, the performance of the face recognition application was examined across a heterogeneous cluster of multi-core and GPGPU architectures. A combination of multi-core processors and GPGPUs achieved roughly a 996 times speedup over a single-core CPU implementation. From examining these PR approaches for acceleration, this dissertation presents useful techniques and insight applicable to other algorithms to improve performance when designing a parallel implementation

    FPGA implementation of a Restricted Boltzmann Machine for handwriting recognition

    Get PDF
    Despite the recent success of neural network in the research eld, the num- ber of resulting applications for non-academic settings is very limited. One setback for its popularity is that neural networks are typically implemented as software running on a general-purpose processor. The time complexity of the software implementation is usually O(n2). As a result, neural net- works are inadequate to meet the scalability and performance requirements for commercial or industrial uses. Several research works have dealt with accelerating neural networks on Field-Programmable Gate Arrays (FPGAs), particularly for Restricted Boltzmann Machines (RBMs) | a very popular and hardware-friendly neural network model. However, when using their implementations for handwriting recognition, there are two major setbacks. First, the implementations assume that the sizes of the neural networks are symmetric, while the size of RBM model for handwriting recognition is in fact highly asymmetric. Second, these implementations cannot t a model with a visible layer larger than 512 nodes on a single FPGA. Thus, they are highly ine cient when apply to handwriting recognition application. In this thesis, a new framework was proposed for an RBM with asymmetric weights optimizing for handwriting recognition. The framework is tested on an Altera Stratix IV GX(EP4SGX230KF40C2) FPGA running at 100 MHz. The resources support a complete RBM model of 784 by 10 nodes. The experimental results show the computational speed of 4 billion connection- update-per-second and a speed-up of 134 fold with I/O time and a speed- up of 161 fold without I/O time compared with an optimized MATLAB implementation running on a 2.50 GHz Intel processor. Compared with previous works, our implementation is able to achieve a much higher speed- up while maintaining comparable resources used
    corecore