55 research outputs found

    Hardware Architecture for Semantic Comparison

    Get PDF
    Semantic Routed Networks provide a superior infrastructure for complex search engines. In a Semantic Routed Network (SRN), the routers are the critical component and they perform semantic comparison as their key computation. As the amount of information available on the Internet grows, the speed and efficiency with which information can be retrieved to the user becomes important. Most current search engines scale to meet the growing demand by deploying large data centers with general purpose computers that consume many megawatts of power. Reducing the power consumption of these data centers while providing better performance, will help reduce the costs of operation significantly. Performing operations in parallel is a key optimization step for better performance on general purpose CPUs. Current techniques for parallelization include architectures that are multi-core and have multiple thread handling capabilities. These coarse grained approaches have considerable resource management overhead and provide only sub-linear speedup. This dissertation proposes techniques towards a highly parallel, power efficient architecture that performs semantic comparisons as its core activity. Hardware-centric parallel algorithms have been developed to populate the required data structures followed by computation of semantic similarity. The performance of the proposed design is further enhanced using a pipelined architecture. The proposed algorithms were also implemented on two contemporary platforms such as the Nvidia CUDA and an FPGA for performance comparison. In order to validate the designs, a semantic benchmark was also been created. It has been shown that a dedicated semantic comparator delivers significantly better performance compared to other platforms. Results show that the proposed hardware semantic comparison architecture delivers a speedup performance of up to 10^5 while reducing power consumption by 80% compared to traditional computing platforms. Future research directions including better power optimization, architecting the complete semantic router and using the semantic benchmark for SRN research are also discussed

    High Performance Information Filtering on Many-core Processors

    Get PDF
    The increasing amount of information accessible to a user digitally makes search difficult, time consuming and unsatisfactory. This has led to the development of active information filtering (recommendation) systems that learn a user’s preference and filter out the most relevant information using sophisticated machine learning techniques. To be scalable and effective, such systems are currently deployed in cloud infrastructures consisting of general-purpose computers. The emergence of many-core processors as compute nodes in cloud infrastructures necessitates a revisit of the computational model, run-time, memory hierarchy and I/O pipelines to fully exploit available concurrency within these processors. This research proposes algorithms & architectures to enhance the performance of content-based (CB) and collaborative information filtering (CF) on many-core processors. To validate these methods, we use Nvidia’s Tesla, Fermi and Kepler GPUs and Intel’s experimental single chip cloud computer (SCC) as the target platforms. We observe that ~290x speedup and up to 97% energy savings over conventional sequential approaches. Finally, we propose and validate a novel reconfigurable SoC architecture which combines the best features of GPUs & SCC. This has been validated to show ~98K speedup over SCC and ~15K speedup over GPU

    Programmer-transparent efficient parallelism with skeletons

    Get PDF
    Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant complexity at the software level to the detriment of programmer productivity. To produce correct and efficient code programmers not only have to manage synchronisation and communication but also be aware of low-level hardware details. It is foresee able that the problem is becoming worse because systems are increasingly parallel and heterogeneous. Building on earlier work, this thesis further investigates the contribution which algorithmic skeletons can make towards solving this problem. Skeletons are high-level abstractions for typical parallel computations. They hide low-level hardware details from programmers and, in addition, encode information about the computations that they implement, which runtime systems and library developers can use for automatic optimisations. We present two novel case studies in this respect. First, we provide scheduling flexibility on heterogeneous CPU + GPU systems in a programmer transparent way similar to the freedom OS schedulers have on CPUs. Thanks to the high-level nature of skeletons we automatically switch between CPU and GPU implementations of kernels and use semantic information encoded in skeletons to find execution time points at which switches can occur. In more detail, kernel iteration spaces are processed in slices and migration is considered on a slice-by-slice basis. We show that slice sizes choices that introduce negligible overheads can be learned by predictive models. We show that in a simple deployment scenario mid-kernel migration achieves speedups of up to 1.30x and 1.08x on average. Our mechanism introduces negligible overheads of 2.34% if a kernel does not actually migrate. Second, we propose skeletons to simplify the programming of parallel hard real-time systems. We combine information encoded in task farms with real-time systems user code analysis to automatically choose thread counts and an optimisation parameter related to farm internal communication. Both parameters are chosen so that real-time deadlines are met with minimum resource usage. We show that our approach achieves 1.22x speedup over unoptimised code, selects the best parameter settings in 83% of cases, and never chooses parameters that cause deadline misses

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Bridging the Scalability Gap by Exploiting Error Tolerance for Emerging Applications

    Full text link
    In recent years, there has been a surge in demand for intelligent applications. These emerging applications are powered by algorithms from domains such as computer vision, image processing, pattern recognition, and machine learning. Across these algorithms, there exist two key computational characteristics. First, the computational demands they place on computing infrastructure is large, with the potential to substantially outstrip existing compute resources. Second, they are necessarily resilient to errors due to their inputs and outputs being inherently noisy and imprecise. Despite the staggering computational requirements and resilience of intelligent applications, current infrastructure uses conventional software and hardware methodologies. These systems needlessly consume resources for every bit of precision and arithmetic. To address this inefficiency and help bridge the performance gap caused by intelligent applications, this dissertation investigates exploiting error tolerance across the hardware-software stack. Specifically, we propose (1) statistical machinery to guarantee that accuracy is not compromised when removing work or precision, (2) a GPU optimization framework for work skipping and bottleneck mitigation, and (3) exploration of unconventional numerical representations to steer future hardware designs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144025/1/parkerhh_1.pd

    A Virtual Prototype of Scalable Network-on-Chip Design

    Get PDF
    A Virtual Prototype of Network-on-Chip (NoC) that interconnects IPs in System-on-Chip is presented in this thesis. A Virtual Prototype is a software model describing various components of NoC put together for simulation and experiments of large SoCs (System-on-Chips). It is a practical way to validate interconnection and working of SoCs with a large number of components in scalable manner. In spite of extensive studies on NoC design, a virtual prototype of NoC is unavailable to academic community. The proposed cycle accurate model of NoC is perhaps the first academic virtual prototype of NoC (VPNoC). The VPNoC can provide similar functionalities as the NoC in the existing simulators. Furthermore, since it is implemented on Carbon SoC Designer, an ARM based SoC development tool, it can be applied directly to current/future SoC design. The proposed VPNoC has been used to demonstrate the design of two SoC applications. In this study, we have achieved: 1) designs and implementations of the NoC components and the VPNoC, 2) measurement of throughput and latency for the VPNoC, and 3) two data intensive applications and their performance analysis

    Data Parallel C++

    Get PDF
    Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression
    • …
    corecore