85 research outputs found

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Fast time- and frequency-domain finite-element methods for electromagnetic analysis

    Get PDF
    Fast electromagnetic analysis in time and frequency domain is of critical importance to the design of integrated circuits (IC) and other advanced engineering products and systems. Many IC structures constitute a very large scale problem in modeling and simulation, the size of which also continuously grows with the advancement of the processing technology. This results in numerical problems beyond the reach of existing most powerful computational resources. Different from many other engineering problems, the structure of most ICs is special in the sense that its geometry is of Manhattan type and its dielectrics are layered. Hence, it is important to develop structure-aware algorithms that take advantage of the structure specialties to speed up the computation. In addition, among existing time-domain methods, explicit methods can avoid solving a matrix equation. However, their time step is traditionally restricted by the space step for ensuring the stability of a time-domain simulation. Therefore, making explicit time-domain methods unconditionally stable is important to accelerate the computation. In addition to time-domain methods, frequency-domain methods have suffered from an indefinite system that makes an iterative solution difficult to converge fast. The first contribution of this work is a fast time-domain finite-element algorithm for the analysis and design of very large-scale on-chip circuits. The structure specialty of on-chip circuits such as Manhattan geometry and layered permittivity is preserved in the proposed algorithm. As a result, the large-scale matrix solution encountered in the 3-D circuit analysis is turned into a simple scaling of the solution of a small 1-D matrix, which can be obtained in linear (optimal) complexity with negligible cost. Furthermore, the time step size is not sacrificed, and the total number of time steps to be simulated is also significantly reduced, thus achieving a total cost reduction in CPU time. The second contribution is a new method for making an explicit time-domain finite-element method (TDFEM) unconditionally stable for general electromagnetic analysis. In this method, for a given time step, we find the unstable modes that are the root cause of instability, and deduct them directly from the system matrix resulting from a TDFEM based analysis. As a result, an explicit TDFEM simulation is made stable for an arbitrarily large time step irrespective of the space step. The third contribution is a new method for full-wave applications from low to very high frequencies in a TDFEM based on matrix exponential. In this method, we directly deduct the eigenmodes having large eigenvalues from the system matrix, thus achieving a significantly increased time step in the matrix exponential based TDFEM. The fourth contribution is a new method for transforming the indefinite system matrix of a frequency-domain FEM to a symmetric positive definite one. We deduct non-positive definite component directly from the system matrix resulting from a frequency-domain FEM-based analysis. The resulting new representation of the finite-element operator ensures an iterative solution to converge in a small number of iterations. We then add back the non-positive definite component to synthesize the original solution with negligible cost

    CMVSIM user's guide

    Get PDF
    Includes bibliographical references (p. 29-30).Supported by the National Science Foundation. MIP 91-17724A. Lumsdaine, M. Silveira, J. White

    Gated SPAD Arrays for Single-Photon Time-Resolved Imaging and Spectroscopy

    Get PDF
    In this paper, we present the architecture and the experimental characterization of an improved version of a previously developed 32 × 32 Single Photon Avalanche Diodes (SPADs) and Time to Digital Converters (TDCs) array, and two new arrays (with 8 × 8 and 128 × 1 pixels) with the additional capability of actively gating the detectors with sub-nanosecond rise time. The arrays include high performance SPADs (0.04 cps/μm2, 50% peak PDE) and provide down to 410 ps Full-Width at Half-Maximum (FWHM) single shot precision and excellent linearity. We developed a camera to exploit these imagers in time-resolved, single-photon applications

    Parallel VLSI Circuit Analysis and Optimization

    Get PDF
    The prevalence of multi-core processors in recent years has introduced new opportunities and challenges to Electronic Design Automation (EDA) research and development. In this dissertation, a few parallel Very Large Scale Integration (VLSI) circuit analysis and optimization methods which utilize the multi-core computing platform to tackle some of the most difficult contemporary Computer-Aided Design (CAD) problems are presented. The first CAD application that is addressed in this dissertation is analyzing and optimizing mesh-based clock distribution network. Mesh-based clock distribution network (also known as clock mesh) is used in high-performance microprocessor designs as a reliable way of distributing clock signals to the entire chip. The second CAD application addressed in this dissertation is the Simulation Program with Integrated Circuit Emphasis (SPICE) like circuit simulation. SPICE simulation is often regarded as the bottleneck of the design flow. Recently, parallel circuit simulation has attracted a lot of attention. The first part of the dissertation discusses circuit analysis techniques. First, a combination of clock network specific model order reduction algorithm and a port sliding scheme is presented to tackle the challenges in analyzing large clock meshes with a large number of clock drivers. Our techniques run much faster than the standard SPICE simulation and existing model order reduction techniques. They also provide a basis for the clock mesh optimization. Then, a hierarchical multi-algorithm parallel circuit simulation (HMAPS) framework is presented as an novel technique of parallel circuit simulation. The inter-algorithm parallelism approach in HMAPS is completely different from the existing intra-algorithm parallel circuit simulation techniques and achieves superlinear speedup in practice. The second part of the dissertation talks about parallel circuit optimization. A modified asynchronous parallel pattern search (APPS) based method which utilizes the efficient clock mesh simulation techniques for the clock driver size optimization problem is presented. Our modified APPS method runs much faster than a continuous optimization method and effectively reduces the clock skew for all test circuits. The third part of the dissertation describes parallel performance modeling and optimization of the HMAPS framework. The performance models and runtime optimization scheme improve the speed of HMAPS further more. The dynamically adapted HMAPS becomes a complete solution for parallel circuit simulation

    Circuit Optimization Using Efficient Parallel Pattern Search

    Get PDF
    Circuit optimization is extremely important in order to design today's high performance integrated circuits. As systems become more and more complex, traditional optimization techniques are no longer viable due to the complex and simulation intensive nature of the optimization problem. Two examples of such problems include clock mesh skew reduction and optimization of large analog systems, for example Phase locked loops. Mesh-based clock distribution has been employed in many high-performance microprocessor designs due to its favorable properties such as low clock skew and robustness. However, such clock distributions can become quite complex and may consist of hundreds of nonlinear drivers strongly coupled via a large passive network. While the simulation of clock meshes is already very time consuming, tuning such networks under tight performance constraints is an even daunting task. Same is the case with the phase locked loop. Being composed of multiple individual analog blocks, it is an extremely challenging task to optimize the entire system considering all block level trade-offs. In this work, we address these two challenging optimization problems i.e.; clock mesh skew optimization and PLL locking time reduction. The expensive objective function evaluations and difficulty in getting explicit sensitivity information make these problems intractable to standard optimization methods. We propose to explore the recently developed asynchronous parallel pattern search (APPS) method for efficient driver size tuning. While being a search-based method, APPS not only provides the desirable derivative-free optimization capability, but also is amenable to parallelization and possesses appealing theoretically rigorous convergence properties. In this work it is shown how such a method can lead to powerful parallel optimization of these complex problems with significant runtime and quality advantages over the traditional sequential quadratic programming (SQP) method. It is also shown how design-specific properties and speeding-up techniques can be exploited to make the optimization even more efficient while maintaining the convergence of APPS in a practical sense. In addition, the optimization technique is further enhanced by introducing the feature to handle non-linear constraints through the use of penalty functions. The enhanced method is used for optimizing phase locked loops at the system level

    Fast Algorithms for High Frequency Interconnect Modeling in VLSI Circuits and Packages

    Get PDF
    Interconnect modeling plays an important role in design and verification of VLSI circuits and packages. For low frequency circuits, great advances for parasitic resistance and capacitance extraction have been achieved and wide varieties of techniques are available. However, for high frequency circuits and packages, parasitic inductance and impedance extraction still poses a tremendous challenge. Existing algorithms, such as FastImp and FastHenry developed by MIT, are slow and inherently unable to handle multiple dielectrics and magnetic materials. In this research, we solve three problems in interconnect modeling for high frequency circuits and packages. 1) Multiple dielectrics are common in integrated circuits and packages. We propose the first Boundary Element Method (BEM) algorithm for impedance extraction of interconnects with multiple dielectrics. The algorithm uses a novel equivalentcharge formulation to model the extraction problem with significantly fewer unknowns. Then fast matrix-vector multiplication and effective preconditioning techniques are applied to speed up the solution of linear systems. Experimental results show that the algorithm is significantly faster than existing methods with sufficient accuracy. 2) Magnetic materials are widely used in MEMS, RFID and MRAM. We present the first BEM algorithm to extract interconnect inductance with magnetic materials. The algorithm models magnetic characteristics by the Landau Lifshitz Gilbert equation and fictitious magnetic charges. The algorithm is accelerated by approximating magnetic charge effects and by modeling currents with solenoidal basis. The relative error of the algorithm with respect to the commercial tool is below 3%, while the speed is up to one magnitude faster. 3) Since traditional interconnect model includes mutual inductances between pairs of segments, the resulting circuit matrix is very dense. This has been the main bottleneck in the use of the interconnect model. Recently, K = L-1 is used. The RKC model is sparse and stable. We study the practical issues of the RKC model. We validate the RKC model and propose an efficient way to achieve high accuracy extraction by circuit simulations of practical examples

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Switched-capacitor networks for image processing : analysis, synthesis, response bounding, and implementation

    Get PDF
    Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (p. 279-284).by Mark N. Seidel.Sc.D

    Doctor of Philosophy

    Get PDF
    dissertationAsynchronous design has a very promising potential even though it has largely received a cold reception from industry. Part of this reluctance has been due to the necessity of custom design languages and computer aided design (CAD) flows to design, optimize, and validate asynchronous modules and systems. Next generation asynchronous flows should support modern programming languages (e.g., Verilog) and application specific integrated circuits (ASIC) CAD tools. They also have to support multifrequency designs with mixed synchronous (clocked) and asynchronous (unclocked) designs. This work presents a novel relative timing (RT) based methodology for generating multifrequency designs using synchronous CAD tools and flows. Synchronous CAD tools must be constrained for them to work with asynchronous circuits. Identification of these constraints and characterization flow to automatically derive the constraints is presented. The effect of the constraints on the designs and the way they are handled by the synchronous CAD tools are analyzed and reported in this work. The automation of the generation of asynchronous design templates and also the constraint generation is an important problem. Algorithms for automation of reset addition to asynchronous circuits and power and/or performance optimizations applied to the circuits using logical effort are explored thus filling an important hole in the automation flow. Constraints representing cyclic asynchronous circuits as directed acyclic graphs (DAGs) to the CAD tools is necessary for applying synchronous CAD optimizations like sizing, path delay optimizations and also using static timing analysis (STA) on these circuits. A thorough investigation for the requirements of cycle cutting while preserving timing paths is presented with an algorithm to automate the process of generating them. A large set of designs for 4 phase handshake protocol circuit implementations with early and late data validity are characterized for area, power and performance. Benchmark circuits with automated scripts to generate various configurations for better understanding of the designs are proposed and analyzed. Extension to the methodology like addition of scan insertion using automatic test pattern generation (ATPG) tools to add testability of datapath in bundled data asynchronous circuit implementations and timing closure approaches are also described. Energy, area, and performance of purely asynchronous circuits and circuits with mixed synchronous and asynchronous blocks are explored. Results indicate the benefits that can be derived by generating circuits with asynchronous components using this methodology
    corecore