48 research outputs found

    Query Profiler Versus Cache for Skyline Computation

    Get PDF
    A skyline query is multi preference user query which generates the best objects from a multi attributed dataset. Skyline computation in an optimum time becomes a real challenge when the number of user preference are large and size of the dataset is also huge. When such a big data gets queried at large, response time optimization is possible through maintenance of the metadata about the pre-executed skyline queries. We have earlier proposed, a novel structure namely �Query Profiler� which preserves such metadata about the historical queries, raised against a dataset. Also as the dataset gets queried at large, the dimensions of user queries often overlap and queries get correlated. Such correlations in user queries and the availability of metadata about the earlier queries, combined together speed up the computation time and the optimization of the response time of the further skyline computation becomes possible. In this paper, we assert the efficacy of the Query Profiler by comparing its performance with the parallel techniques which utilize cache mechanism for optimization of the response time. We also present the experimental results which assert the efficacy of the proposed technique

    Maximizing resource utilization by slicing of superscalar architecture

    Full text link
    Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    Design and synthesis of a high-performance, hyper-programmable DSP on an FPGA

    Get PDF
    In the field of high performance digital signal processing, DSPs and FPGAs provide the most flexibility. Due to the extensive customization available on FPGAs, DSP algorithm implementation on an FPGA exhibits an increased development time over programming a processor. Because of this, traditional DSPs typically yield a faster time to market than an FPGA design. However, it is often desirable to have the ASIC-like performance that is attainable through the additional customization and parallel computation available through an FPGA. This can be achieved through the class of processors known as hyper-programmable DSPs. A hyper-programmable DSP is a DSP in which multiple aspects of the architecture are programmable. This thesis contributes such a DSP, targeted for high-performance and realized in hardware using an FPGA. The design consists of both a scalar datapath and a vector datapath capable of parallel operations, both of which are extensively customizable. To aid in the design of the datapaths, graphical tools are introduced as an efficient way to modify the design. A tool was also created to supply a graphical interface to help write instructions for the vector datapath. Additionally, an adaptive assembler was created to convert assembly programs to machine code for any datapath design. The resulting design was synthesized for a Cyclone III FPGA. The synthesis resulted in a design capable of running at 135MHz with 61% of the logic used by processing elements. Benchmarks were run on the design to evaluate its performance. The benchmarks showed similar performance between the proposed design and commercial DSPs for the simple benchmarks but significant improvement for the more complex ones

    Acceleration of MCMC-based algorithms using reconfigurable logic

    Get PDF
    Monte Carlo (MC) methods such as Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) have emerged as popular tools to sample from high dimensional probability distributions. Because these algorithms can draw samples effectively from arbitrary distributions in Bayesian inference problems, they have been widely used in a range of statistical applications. However, they are often too time consuming due to the prohibitive costly likelihood evaluations, thus they cannot be practically applied to complex models with large-scale datasets. Currently, the lack of sufficiently fast MCMC methods limits their applicability in many modern applications such as genetics and machine learning, and this situation is bound to get worse given the increasing adoption of big data in many fields. The objective of this dissertation is to develop, design and build efficient hardware architectures for MCMC-based algorithms on Field Programmable Gate Arrays (FPGAs), and thereby bring them closer to practical applications. The contributions of this work include: 1) Novel parallel FPGA architectures of the state-of-the-art resampling algorithms for SMC methods. The proposed architectures allow for parallel implementations and thus improve the processing speed. 2) A novel mixed precision MCMC algorithm, along with a tailored FPGA architecture. The proposed design allows for more parallelism and achieves low latency for a given set of hardware resources, while still guaranteeing unbiased estimates. 3) A new variant of subsampling MCMC method based on unequal probability sampling, along with a highly optimized FPGA architecture. The proposed method significantly reduces off-chip memory access and achieves high accuracy in estimates for a given time budget. This work has resulted in the development of hardware accelerators of MCMC and SMC for very large-scale Bayesian tasks by applying the above techniques. Notable speed improvements compared to the respective state-of-the-art CPU and GPU implementations have been achieved in this work.Open Acces

    Accelerated Financial Applications through Specialized Hardware, FPGA

    Get PDF
    This project will investigate Field Programmable Gate Array (FPGA) technology in financial applications. FPGA implementation in high performance computing is still in its infancy. Certain companies like XtremeData inc. advertized speed improvements of 50 to 1000 times for DNA sequencing using FPGAs, while using an FPGA as a coprocessor to handle specific tasks provides two to three times more processing power. FPGA technology increases performance by parallelizing calculations. This project will specifically address speed and accuracy improvements of both fundamental and transcendental functions when implemented using FPGA technology. The results of this project will lead to a series of recommendations for effective utilization of FPGA technology in financial applications

    Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

    Full text link
    Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power

    Numerical solutions of differential equations on FPGA-enhanced computers

    Get PDF
    Conventionally, to speed up scientific or engineering (S&E) computation programs on general-purpose computers, one may elect to use faster CPUs, more memory, systems with more efficient (though complicated) architecture, better software compilers, or even coding with assembly languages. With the emergence of Field Programmable Gate Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists and engineers now have another option using FPGA devices as core components to address their computational problems. The hardware-programmable, low-cost, but powerful “FPGA-enhanced computer” has now become an attractive approach for many S&E applications. A new computer architecture model for FPGA-enhanced computer systems and its detailed hardware implementation are proposed for accelerating the solutions of computationally demanding and data intensive numerical PDE problems. New FPGAoptimized algorithms/methods for rapid executions of representative numerical methods such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are designed, analyzed, and implemented on it. Linear wave equations based on seismic data processing applications are adopted as the targeting PDE problems to demonstrate the effectiveness of this new computer model. Their sustained computational performances are compared with pure software programs operating on commodity CPUbased general-purpose computers. Quantitative analysis is performed from a hierarchical set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized numerical algorithms or methods that may be inappropriate for conventional general-purpose computers. The preferable property of in-system hardware reconfigurability of the new system is emphasized aiming at effectively accelerating the execution of complex multi-stage numerical applications. Methodologies for accelerating the targeting PDE problems as well as other numerical PDE problems, such as heat equations and Laplace equations utilizing programmable hardware resources are concluded, which imply the broad usage of the proposed FPGA-enhanced computers

    Scalable accelerator for nonuniform multi-word log-quantized neural network

    Get PDF
    Department of Electrical EngineeringLogarithmic quantization has many hardware-friendly features, but its lower accuracy in certain conditions has prevented more widespread use. Recently modified schemes have been proposed to solve the accuracy problem without compromising its hardware efficiency by selectively employing multiple words. This however causes variable-latency multiplication, demanding a new hardware architecture to support efficient mapping of large neural network layers as well as various types of convolution layers such as depthwise separable convolution. In this paper we present a novel hardware architecture for nonuniform multi-word log-quantized neural networks that is scalable with the number of processing elements while maximizing data reuse. Our architecture supports depthwise convolution and pointwise convolution as well as 3D convolution, which are important for recent mobile-friendly networks. We also propose a hardware-software cooperative optimization to reduce the impact of variable-latency multiplication on performance. Our experimental results using various convolution layers from MobileNetV2 demonstrate the speed advantage of our architecture and high scalability with the number of PEs, compared with previous architectures for depthwise separable convolution or log quantization. Our results also show that our optimization is very effective in improving the performance of our architecture.clos
    corecore