236 research outputs found

    Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Full text link
    Stream processing is a compute paradigm that promises safe and efficient parallelism. Modern big-data problems are often well suited for stream processing's throughput-oriented nature. Realization of efficient stream processing requires monitoring and optimization of multiple communications links. Most techniques to optimize these links use queueing network models or network flow models, which require some idea of the actual execution rate of each independent compute kernel within the system. What we want to know is how fast can each kernel process data independent of other communicating kernels. This is known as the "service rate" of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. Shared cloud systems also present applications with highly dynamic execution environments (multiple users, hardware migration, etc.). It is therefore desirable to continuously re-tune an application during run time (online) in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need for reliance on steady state predictions for what are probably non-steady state phenomena. First, some of the difficulties associated with online service rate determination are examined. Second, the algorithm to approximate the online non-blocking service rate is described. Lastly, the algorithm is implemented within the open source RaftLib framework for validation using a simple microbenchmark as well as two full streaming applications.Comment: technical repor

    LSIM User Manual

    Get PDF
    Lsim is a gate/switch level digital logic similar. It enables users to model digital circuits both at the gate and switch level and incorporates features that support investigation of the simulation task itself. This user\u27s manual describes the procedures used to specify a circuit to lsim and control the simulation of the circuit (i.e., specifying inputs vectors, running the simulation, and monitoring output signals)

    Performance Modeling of Virtualized Custom Logic Computations

    Get PDF
    Virtualization of custom logic computations (i.e., by sharing a fixed function across distinct data streams) provides a means of reusing hardware resources, particularly when resources are limited. This is common practice in traditional processors where more than one user can share processor resources. In this paper, we virtualize a custom logic block using C-slow techniques to support fine-grain context-switching. We then develop and present an analytic model for several performance measures (throughput, latency, input queue occupancy) for both fine-grained and coarse-grained context switching (to a secondary memory). Next, we calibrate the analytic performance model with empirical measurements. We then validate the model via discrete-event simulation and use the model to predict the performance and develop optimal schedules for virtualized logic computations. We present results for a Taylor series expansion of a cosine function with added feedback and an AES encryption cipher

    Hierarchical Discrete-Event Simulation on Hypercube Architecture

    Get PDF
    This paper presents model of hierarchical discrete-event simulation algorithm running on a hypercube architecture. We assume a static allocation of system components to processors in the hypercube. We also assume a global clock algorithm, with an event-based time increment. Following development of the performance model, we describe an application of the model in the area of digital systems simulation. Hierarchical levels included are gate level (NAND, NOR, and NOT gates) and MSI level (multiplexors, shift registers, etc.). Example values (gathered from simulations running on standard von Neumann architectures) are provided at the model inputs to show the effect of different model parameters and partitioning strategies on the simulation performance

    A Unified Approach to Mixed-Mode Simulation

    Get PDF
    This paper presents a unified approach to mixed-mode simulation. It investigates the algorithms for both logic and circuit simulation, considering their similarities and differences, and a general framework is presented for integrating the two algorithms in uniform manner. The time advance mechanisms and component functional evaluations of the algorithms are show to be similar in nature, and mechanisms for the translation of information represented uniquely in the two algorithms are given. The resulting integrated algorithms is capable of performing mixed-mode simulation, where a circuit is partitioned into discrete and continuous regions, and each region is simulated at the appropriate level. In addition, several of the issues relating to the implementation of mixed-mode simulation on multiprocessors are presented

    Collecting Data About Logic Simulation

    Get PDF
    Design of high performance hardware and software based gate-switch level logic simulators requires knowledge about the logic simulation process itself. Unfortunately, little data is publically available concerning key aspects of this process. An example of this is the lack of published empirical measurements relating to the time distribution of events generated by such simulators. This paper presents a gate-switch level logic simulator lsim which is oriented towards the collection of data about the simulation process. The basic components of lsim are reviewed, and its relevant data gathering facilities are discussed. An example is presented which illustrates the use of lsim in gathering data on event distributions and on communications requirements under alternative logic circuit partitionings

    LSIM2 User\u27s Manual

    Get PDF
    Lsim2 is gate/switch-level digital logic simulator. It enables users to model digital circuits both at the gate and switch level and incorporates features the support investigation of the simulation task itself. Lsim2 is an augmented version of the original lsim* with the addition of several new MSI-type components models. This user\u27s manual describes procedures for specifying a circuit in lsim2, mechanisms for controlling the simulation, and approaches to modeling systems

    Split and Merge Functions for Supporting Multiple Processing Pipelines in Mercury BLASTN

    Get PDF
    Biosequence similarity search is an important application in computational biology. Mercury BLASTN, an FPGA-based implementation of BLAST for DNA, is one of the alternatives for fast DNA sequence comparison. The re-design of BLAST into a streaming application combined with a high-throughput hardware pipeline have enabled Mercury BLAST to emerge as one of the fastest implementations of bio-sequence similarity search. This performance can be further enhanced by exploiting the data-level parallelism present within the application. Here we present a multiple FPGA-based Mercury BLASTN design in order to double the speed and throughput of DNA sequence computation. This paper describes a dual Mercury BLASTN design, the detailed design of the split and merge functions, and simulation results

    Performance Tuning of Streaming Applications via Search-space Decomposition

    Get PDF
    High-performance streaming applications are typically pipelined and deployed on architecturally diverse (hybrid)systems. Developers of such applications are interested in customizing components used, so as to benefit application performance. We present an efficient and automatic technique for design-space exploration of applications in this problem domain. We solve performance tuning as an optimization problem by formulating cost functions using results from queueing theory. This results in a mixed-integer nonlinear optimization problem which is NP-hard. We reduce the search complexity by decomposing the search space. We have developed a domain-specific decomposition technique using topological information of the application embodied in the queueing network models. Our analysis includes when our decomposition preserves optimality. Our preliminary empirical results confirm two-fold benefits--solving a problem that is currently not solvable using state-of-the-art solvers and in some problem instances, improving initial solution value from the solver by over two orders of magnitude

    Throughput-optimal systolic arrays from recurrence equations

    Get PDF
    Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized architectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic programming algorithms that can be expressed as systems of recurrence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, overall throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize throughput of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum number of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array’s projection vector. We describe a bounded search process to find throughput-optimal projection vectors and a tool to perform automated design space exploration, discovering a range of array designs that are optimal for inputs of different sizes. We apply our techniques to the Nussinov RNA folding algorithm to generate multiple mappings of this algorithm into systolic arrays. By combining our library of designs with run-time reconfiguration of an FPGA device to dynamically switch among them, we predict significant speedup over a single, latency-space optimal array
    • …
    corecore