48 research outputs found

    Efficient emulation of MIMD behavior on SIMD machines

    Get PDF
    SIMD computers have proved to be a useful and cost effective approach to massively parallel computation. On the other hand, there are algorithms which are very inefficient when directly translated into a data-parallel program.This paper presents a number of simple transformations which are able to reduce this SIMD overhead to a moderate constant factor. It also introduces techniques for reducing the remaining overhead using Markov chain models of control flow. The optimization problems involved are NP-hard in general but there are many useful heuristics, and closed form optimizations for a probabilistic variant

    Dynamically reconfigurable architecture for embedded computer vision systems

    Get PDF
    The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses

    Porting and optimizing a routing library for C* on the iPSC/860

    Get PDF
    High level data-parallel languages are easy to use and shield the programmer from machine specific details. A simple and efficient way of providing an interface to such languages is to develop a machine-independent compiler and a routing library, which isolates the low-level machine dependent communication functions, The compiler translates the high-level language source code into C (or some other high颅-level sequential language) code. It also includes calls to the routing library routines whenever it comes across a statement in the source code, which requires a communication. C* is a data-parallel language designed for Connection Machine computers by Thinking Machines corporation. A routing library for C* on the Intel Touchstone Delta was developed at the University of New Hampshire. This project dealt with porting this library to the Intel iPSC/860 system by making alterations to the library routines and the makefile. The original mesh-based communication routines were also converted to hypercube style communications. The assembly code generated for these hypercube routines was optimized. Various benchmark C* programs were written and the timings for these were recorded, and the speedup curves plotted

    HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors

    Get PDF
    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les m脿quines issue i commit per processadors actuals. Hem implementat una m脿quina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no 茅s m茅s que una tra莽a de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusi贸 d鈥檌nstruccions, blocs b脿sics. Aquests superblocks s鈥檕ptimitzen utilitzant optimizacions especualtives i d鈥檃ltres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gesti贸 de errades en l鈥檈speculaci贸. Al llarg d鈥檃questa tesis s鈥檋an fet les seg眉ents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l鈥檈xecuci贸 d鈥檃plicacions de proposit general. La PFU est脿 formada per un conjunt d鈥檜nitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribu茂t a les unitats funcionals que la composen. Les entrades de la macro-operaci贸 que s鈥檈xecuta en la PFU es mouen del banc de registres f铆sic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusi贸 combina m茅s micro-operacions en temps d鈥檈xecuci贸. Aquest algorisme es basa en un pas de planificaci贸 que mesura el benefici de les decisions de fusi贸. Les micro operacions corresponents a la macro operaci贸 s鈥檈mmagatzemen com a senyals de control en una configuraci贸. Les macro-operacions tenen associat un identificador de configuraci贸 que ajuda a localitzar d鈥檃questes. Una petita cache de configuracions est脿 present dintre de la PFU per tal de guardar-les. En cas de que la configuraci贸 no estigui a la cache, les configuracions es carreguen de la cache d鈥檌nstruccions. Per altre banda, per tal de donar support al commit at貌mic dels superblocks que sobrepassen el tamany del ROB s鈥檋a proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d鈥檃cceleradors. Aquests acceleradors executen un parell d鈥檌nstruccions fusionades. S鈥檋an considerat dos tipus de fusi贸 d鈥檌nstructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d鈥檌nstruccions simples d鈥檃lu que s贸n dependents i que s鈥檈xecutaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d鈥檃ltres. Aquesta proposta ha estat comparada amb un processador fora d鈥檕rdre. Tercer, hem proposat un processador fora d鈥檕rdre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit at貌mic del superblocks. En aquesta soluci贸 hem substitu茂t el ROB convencional, i en lloc hem introdu茂t el Superblock Ordering Buffer (SOB). El SOB mant茅 l鈥檕dre de programa a granularitat de superblock. L鈥檈stat del programa consisteix en registres i mem貌ria. L鈥檈stat dels registres es mant茅 en una taula per superblock, mentre que l鈥檈stat de mem貌ria es guarda en un buffer i s鈥檃ctulitza at貌micament. La segona gran area de reducci贸 de complexitat considerarada 茅s l鈥櫭簊 de FIFOs a la l貌gica d鈥檌ssue. En aquest 煤ltim 脿mbit hem proposat una heur铆stica de distribuci贸 que solventa les inefici猫ncies de l鈥檋eur铆stica basada en depend猫ncies anteriorment proposada. Finalment, i junt amb les FIFOs, s鈥檋a proposat un mecanisme per alliberar les entrades de la FIFO anticipadament

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified
    corecore