2,742 research outputs found

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    A Structured Design Methodology for High Performance VLSI Arrays

    Get PDF
    abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.Dissertation/ThesisPh.D. Electrical Engineering 201

    Limits on Fundamental Limits to Computation

    Full text link
    An indispensable part of our lives, computing has also become essential to industries and governments. Steady improvements in computer hardware have been supported by periodic doubling of transistor densities in integrated circuits over the last fifty years. Such Moore scaling now requires increasingly heroic efforts, stimulating research in alternative hardware and stirring controversy. To help evaluate emerging technologies and enrich our understanding of integrated-circuit scaling, we review fundamental limits to computation: in manufacturing, energy, physical space, design and verification effort, and algorithms. To outline what is achievable in principle and in practice, we recall how some limits were circumvented, compare loose and tight limits. We also point out that engineering difficulties encountered by emerging technologies may indicate yet-unknown limits.Comment: 15 pages, 4 figures, 1 tabl

    Design of a 1-chip IBM-3270 protocol handler

    Get PDF
    The single-chip design of a 20MHz IBM-3270 coax protocol handler in a conventional 3 Ό CMOS process-technology is discussed. The harmonious combination of CMOS circuit tricks and high-level design disciplines allows the 50k transistor design to be compiled and optimized into a 35 mm**2 chip in 4 manweeks. The design methodology stresses the application of high-level silicon constructs and built-in testability

    Photonics design tool for advanced CMOS nodes

    Full text link
    Recently, the authors have demonstrated large-scale integrated systems with several million transistors and hundreds of photonic elements. Yielding such large-scale integrated systems requires a design-for-manufacture rigour that is embodied in the 10 000 to 50 000 design rules that these designs must comply within advanced complementary metal-oxide semiconductor manufacturing. Here, the authors present a photonic design automation tool which allows automatic generation of layouts without design-rule violations. This tool is written in SKILL, the native language of the mainstream electric design automation software, Cadence. This allows seamless integration of photonic and electronic design in a single environment. The tool leverages intuitive photonic layer definitions, allowing the designer to focus on the physical properties rather than on technology-dependent details. For the first time the authors present an algorithm for removal of design-rule violations from photonic layouts based on Manhattan discretisation, Boolean and sizing operations. This algorithm is not limited to the implementation in SKILL, and can in principle be implemented in any scripting language. Connectivity is achieved with software-defined waveguide ports and low-level procedures that enable auto-routing of waveguide connections.Comment: 5 pages, 10 figure

    SRAM Compiler For Automated Memory Layout Supporting Multiple Transistor Process Technologies

    Get PDF
    This research details the design of an SRAM compiler for quickly creating SRAM blocks for Cal Poly integrated circuit (IC) designs. The compiler generates memory for two process technologies (IBM 180nm cmrf7sf and ON Semiconductor 600nm SCMOS) and requires a minimum number of specifications from the user for ease of use, while still offering the option to customize the performance for speed or area of the generated SRAM cell. By automatically creating SRAM arrays, the compiler saves the user time from having to layout and test memory and allows for quick updates and changes to a design. Memory compilers with various features already exist, but they have several disadvantages. Most memory compilers are expensive, usually only generate memory for one process technology, and don’t allow for user-defined custom SRAM cell optimizations. This free design makes it available for students and institutions that would not be able to afford an industry-made compiler. A compiler that offers multiple process technologies allows for more freedom to design in other processes if needed or desired. An attempt was made for this design to be modular for different process technologies so new processes could be added with ease; however, different process technologies have different DRC rules, making that option very difficult to attain. A customizable SRAM cell based on transistor sizing ratios allows for optimized designs in speed, area, or power, and for academic research. Even for an experienced designer, the layout of a single SRAM cell (1 bit) can take an hour. This command-line-based tool can draw a 1Kb SRAM block in seconds and a 1Mb SRAM block in about 15 minutes. In addition, this compiler also adds a manually laid out precharge circuit to each of the SRAM columns for an enhanced read operation by ensuring the bit lines have valid logic output values. Finally, an analysis on SRAM cell stability is done for creating a robust cell as the default design for the compiler. The default cell design is verified for stability during read and write operations, and has an area of 14.067 ”m2 for the cmrf7sf process and 246.42 ”m2 for the SCMOS process. All factors considered, this SRAM compiler design overcomes several of the drawbacks of other existing memory compilers

    A Micro Power Hardware Fabric for Embedded Computing

    Get PDF
    Field Programmable Gate Arrays (FPGAs) mitigate many of the problemsencountered with the development of ASICs by offering flexibility, faster time-to-market, and amortized NRE costs, among other benefits. While FPGAs are increasingly being used for complex computational applications such as signal and image processing, networking, and cryptology, they are far from ideal for these tasks due to relatively high power consumption and silicon usage overheads compared to direct ASIC implementation. A reconfigurable device that exhibits ASIC-like power characteristics and FPGA-like costs and tool support is desirable to fill this void. In this research, a parameterized, reconfigurable fabric model named as domain specific fabric (DSF) is developed that exhibits ASIC-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, the impact of varying different design parameters on power and performance has been studied. Different optimization techniques like local search and simulated annealing are used to determine the appropriate interconnect for a specific set of applications. A design space exploration tool has been developed to automate and generate a tailored architectural instance of the fabric.The fabric has been synthesized on 160 nm cell-based ASIC fabrication process from OKI and 130 nm from IBM. A detailed power-performance analysis has been completed using signal and image processing benchmarks from the MediaBench benchmark suite and elsewhere with comparisons to other hardware and software implementations. The optimized fabric implemented using the 130 nm process yields energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor
    • 

    corecore