41 research outputs found

    Asynchronous techniques for new generation variation-tolerant FPGA

    Get PDF
    PhD ThesisThis thesis presents a practical scenario for asynchronous logic implementation that would benefit the modern Field-Programmable Gate Arrays (FPGAs) technology in improving reliability. A method based on Asynchronously-Assisted Logic (AAL) blocks is proposed here in order to provide the right degree of variation tolerance, preserve as much of the traditional FPGAs structure as possible, and make use of asynchrony only when necessary or beneficial for functionality. The newly proposed AAL introduces extra underlying hard-blocks that support asynchronous interaction only when needed and at minimum overhead. This has the potential to avoid the obstacles to the progress of asynchronous designs, particularly in terms of area and power overheads. The proposed approach provides a solution that is complementary to existing variation tolerance techniques such as the late-binding technique, but improves the reliability of the system as well as reducing the design’s margin headroom when implemented on programmable logic devices (PLDs) or FPGAs. The proposed method suggests the deployment of configurable AAL blocks to reinforce only the variation-critical paths (VCPs) with the help of variation maps, rather than re-mapping and re-routing. The layout level results for this method's worst case increase in the CLB’s overall size only of 6.3%. The proposed strategy retains the structure of the global interconnect resources that occupy the lion’s share of the modern FPGA’s soft fabric, and yet permits the dual-rail iv completion-detection (DR-CD) protocol without the need to globally double the interconnect resources. Simulation results of global and interconnect voltage variations demonstrate the robustness of the method

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    FPGA structures for high speed and low overhead dynamic circuit specialization

    Get PDF
    A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

    A Finite Domain Constraint Approach for Placement and Routing of Coarse-Grained Reconfigurable Architectures

    Get PDF
    Scheduling, placement, and routing are important steps in Very Large Scale Integration (VLSI) design. Researchers have developed numerous techniques to solve placement and routing problems. As the complexity of Application Specific Integrated Circuits (ASICs) increased over the past decades, so did the demand for improved place and route techniques. The primary objective of these place and route approaches has typically been wirelength minimization due to its impact on signal delay and design performance. With the advent of Field Programmable Gate Arrays (FPGAs), the same place and route techniques were applied to FPGA-based design. However, traditional place and route techniques may not work for Coarse-Grained Reconfigurable Architectures (CGRAs), which are reconfigurable devices offering wider path widths than FPGAs and more flexibility than ASICs, due to the differences in architecture and routing network. Further, the routing network of several types of CGRAs, including the Field Programmable Object Array (FPOA), has deterministic timing as compared to the routing fabric of most ASICs and FPGAs reported in the literature. This necessitates a fresh look at alternative approaches to place and route designs. This dissertation presents a finite domain constraint-based, delay-aware placement and routing methodology targeting an FPOA. The proposed methodology takes advantage of the deterministic routing network of CGRAs to perform a delay aware placement

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization

    Automated application-specific optimisation of interconnects in multi-core systems

    Get PDF
    In embedded computer systems there are often tasks, implemented as stand-alone devices, that are both application-specific and compute intensive. A recurring problem in this area is to design these application-specific embedded systems as close to the power and efficiency envelope as possible. Work has been done on optimizing singlecore systems and memory organisation, but current methods for achieving system design goals are proving limited as the system capabilities and system size increase in the multi- and many-core era. To address this problem, this thesis investigates machine learning approaches to managing the design space presented in the interconnect design of embedded multi-core systems. The design space presented is large due to the system scale and level of interconnectivity, and also feature inter-dependant parameters, further complicating analysis. The results presented in this thesis demonstrate that machine learning approaches, particularly wkNN and random forest, work well in handling the complexity of the design space. The benefits of this approach are in automation, saving time and effort in the system design phase as well as energy and execution time in the finished system

    Simulation methodologies for future large-scale parallel systems

    Get PDF
    Since the early 2000s, computer systems have seen a transition from single-core to multi-core systems. While single-core systems included only one processor core on a chip, current multi-core processors include up to tens of cores on a single chip, a trend which is likely to continue in the future. Today, multi-core processors are ubiquitous. They are used in all classes of computing systems, ranging from low-cost mobile phones to high-end High-Performance Computing (HPC) systems. Designing future multi-core systems is a major challenge [12]. The primary design tool used by computer architects in academia and industry is architectural simulation. Simulating a computer system executing a program is typically several orders of magnitude slower than running the program on a real system. Therefore, new techniques are needed to speed up simulation and allow the exploration of large design spaces in a reasonable amount of time. One way of increasing simulation speed is sampling. Sampling reduces simulation time by simulating only a representative subset of a program in detail. In this thesis, we present a workload analysis of a set of task-based programs. We then use the insights from this study to propose TaskPoint, a sampled simulation methodology for task-based programs. Task-based programming models can reduce the synchronization costs of parallel programs on multi-core systems and are becoming increasingly important. Finally, we present MUSA, a simulation methodology for simulating applications running on thousands of cores on a hybrid, distributed shared-memory system. The simulation time required for simulation with MUSA is comparable to the time needed for native execution of the simulated program on a production HPC system. The techniques developed in the scope of this thesis permit researchers and engineers working in computer architecture to simulate large workloads, which were infeasible to simulate in the past. Our work enables architectural research in the fields of future large-scale shared-memory and hybrid, distributed shared-memory systems.Des dels principis dels anys 2000, els sistemes d'ordinadors han experimentat una transició de sistemes d'un sol nucli a sistemes de múltiples nuclis. Mentre els sistemes d'un sol nucli incloïen només un nucli en un xip, els sistemes actuals de múltiples nuclis n'inclouen desenes, una tendència que probablement continuarà en el futur. Avui en dia, els processadors de múltiples nuclis són omnipresents. Es fan servir en totes les classes de sistemes de computació, de telèfons mòbils de baix cost fins a sistemes de computació d'alt rendiment. Dissenyar els futurs sistemes de múltiples nuclis és un repte important. L'eina principal usada pels arquitectes de computadors, tant a l'acadèmia com a la indústria, és la simulació. Simular un ordinador executant un programa típicament és múltiples ordres de magnitud més lent que executar el mateix programa en un sistema real. Per tant, es necessiten noves tècniques per accelerar la simulació i permetre l'exploració de grans espais de disseny en un temps raonable. Una manera d'accelerar la velocitat de simulació és la simulació mostrejada. La simulació mostrejada redueix el temps de simulació simulant en detall només un subconjunt representatiu d¿un programa. En aquesta tesi es presenta una anàlisi de rendiment d'una col·lecció de programes basats en tasques. Com a resultat d'aquesta anàlisi, proposem TaskPoint, una metodologia de simulació mostrejada per programes basats en tasques. Els models de programació basats en tasques poden reduir els costos de sincronització de programes paral·lels executats en sistemes de múltiples nuclis i actualment estan guanyant importància. Finalment, presentem MUSA, una metodologia de simulació per simular aplicacions executant-se en milers de nuclis d'un sistema híbrid, que consisteix en nodes de memòria compartida que formen un sistema de memòria distribuïda. El temps que requereixen les simulacions amb MUSA és comparable amb el temps que triga l'execució nativa en un sistema d'alt rendiment en producció. Les tècniques desenvolupades al llarg d'aquesta tesi permeten simular execucions de programes que abans no eren viables, tant als investigadors com als enginyers que treballen en l'arquitectura de computadors. Per tant, aquest treball habilita futura recerca en el camp d'arquitectura de sistemes de memòria compartida o distribuïda, o bé de sistemes híbrids, a gran escala.A principios de los años 2000, los sistemas de ordenadores experimentaron una transición de sistemas con un núcleo a sistemas con múltiples núcleos. Mientras los sistemas single-core incluían un sólo núcleo, los sistemas multi-core incluyen decenas de núcleos en el mismo chip, una tendencia que probablemente continuará en el futuro. Hoy en día, los procesadores multi-core son omnipresentes. Se utilizan en todas las clases de sistemas de computación, de teléfonos móviles de bajo coste hasta sistemas de alto rendimiento. Diseñar sistemas multi-core del futuro es un reto importante. La herramienta principal usada por arquitectos de computadores, tanto en la academia como en la industria, es la simulación. Simular un computador ejecutando un programa típicamente es múltiples ordenes de magnitud más lento que ejecutar el mismo programa en un sistema real. Por ese motivo se necesitan nuevas técnicas para acelerar la simulación y permitir la exploración de grandes espacios de diseño dentro de un tiempo razonable. Una manera de aumentar la velocidad de simulación es la simulación muestreada. La simulación muestreada reduce el tiempo de simulación simulando en detalle sólo un subconjunto representativo de la ejecución entera de un programa. En esta tesis presentamos un análisis de rendimiento de una colección de programas basados en tareas. Como resultado de este análisis presentamos TaskPoint, una metodología de simulación muestreada para programas basados en tareas. Los modelos de programación basados en tareas pueden reducir los costes de sincronización de programas paralelos ejecutados en sistemas multi-core y actualmente están ganando importancia. Finalmente, presentamos MUSA, una metodología para simular aplicaciones ejecutadas en miles de núcleos de un sistema híbrido, compuesto de nodos de memoria compartida que forman un sistema de memoria distribuida. El tiempo de simulación que requieren las simulaciones con MUSA es comparable con el tiempo necesario para la ejecución del programa simulado en un sistema de alto rendimiento en producción. Las técnicas desarolladas al largo de esta tesis permiten a los investigadores e ingenieros trabajando en la arquitectura de computadores simular ejecuciones largas, que antes no se podían simular. Nuestro trabajo facilita nuevos caminos de investigación en los campos de sistemas de memoria compartida o distribuida y en sistemas híbridos

    Ant colony optimization on runtime reconfigurable architectures

    Get PDF

    Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations

    Get PDF
    In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the developer-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model—with its programming advantages—to distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data- parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to avoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a distributed smart pointers library, inspired by C++ smart pointers. In this model, public and pri- vate pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement both standalone applications and higher-level parallel program- ming models, such as data and task parallelism. As for the performance perspective, preliminary experiments show both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort