14 research outputs found

    A Future for Integrated Diagnostic Helping

    Get PDF
    International audienceMedical systems used for exploration or diagnostic helping impose high applicative constraints such as real time image acquisition and displaying. A large part of computing requirement of these systems is devoted to image processing. This chapter provides clues to transfer consumers computing architecture approaches to the benefit of medical applications. The goal is to obtain fully integrated devices from diagnostic helping to autonomous lab on chip while taking into account medical domain specific constraints.This expertise is structured as follows: the first part analyzes vision based medical applications in order to extract essentials processing blocks and to show the similarities between consumer’s and medical vision based applications. The second part is devoted to the determination of elementary operators which are mostly needed in both domains. Computing capacities that are required by these operators and applications are compared to the state-of-the-art architectures in order to define an efficient algorithm-architecture adequation. Finally this part demonstrates that it's possible to use highly constrained computing architectures designed for consumers handled devices in application to medical domain. This is based on the example of a high definition (HD) video processing architecture designed to be integrated into smart phone or highly embedded components. This expertise paves the way for the industrialisation of intergraded autonomous diagnostichelping devices, by showing the feasibility of such systems. Their future use would also free the medical staff from many logistical constraints due the deployment of today’s cumbersome systems

    Reconfigurable Systems for Cryptography and Multimedia Applications

    Get PDF

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Low power digital signal processing

    Get PDF

    Algoritmos para alocação de recursos em arquiteturas reconfiguraveis

    Get PDF
    Orientador: Guido Costa Souza de AraujoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Pesquisas recentes na área de arquiteturas reconfiguráveis mostram que elas oferecem um desempenho melhor que os processadores de propósito geral (GPPs - General Purpose Processors), aliado a uma maior flexibilidade que os ASICs (Application Specific Integrated Circuits). Uma mesma arquitetura recongurável pode ser adaptada para implementar aplicações diferentes, permitindo a especialização do hardware de acordo com a demanda computacional da aplicação. Neste trabalho, nos estudamos o projeto de sistemas dedicados baseado em uma arquitetura reconfigurável. Adotamos a abordagem de extensão do conjunto de instruções, na qual o conjunto de instruções de um GPP e acrescido de instruções especializadas para uma aplicação. Estas instruções correspondem a trechos da aplicação e são executadas em um datapath dinamicamente recongurável, adicionado ao hardware do GPP. O tema central desta tese e o problema de compartilhamento de recursos no projeto do datapath reconfigurável. Dado que os trechos da aplicação são modelados como grafos de luxo de dados e controle (Control/Data-Flow Graphs ¿ CDFGs), o problema de combinação de CDFGs consiste em projetar um datapath reconfigurável com área mínima. Nos apresentamos uma demonstração de que este problema e NP-completo. Nossas principais contribuições são dois algoritmos heurísticos para o problema de combinação de CDFGs. O primeiro tem o objetivo de minimizar a área das interconexões do datapath reconfigurável, enquanto que o segundo visa a minimização da área total. Avaliações experimentais mostram que nossa primeira heurística resultou em uma redução media de 26,2% na área das interconexões, em relação ao método mais utilizado na literatura. O erro máximo de nossas soluções foi em media 4,1% e algumas soluções ótimas foram obtidas. Nosso segundo algoritmo teve tempos de execução comparáveis ao método mais rápido conhecido, obtendo uma redução media de 20% na área. Em relação ao melhor método para área conhecido, nossa heurística produziu áreas um pouco menores, alcançando um speed up médio de 2500. O algoritmo proposto também produziu áreas menores, quando comparado a uma ferramenta de síntese comercialAbstract: Recent work in reconfigurable architectures shows that they ofter a better performance than general purpose processors (GPPs), while offering more exibility than ASICs (Application Specific Integrated Circuits). A reconfigurable architecture can be adapted to implement different applications, thus allowing the specialization of the hardware according to the computational demands. In this work we describe an embedded systems project based on a reconfigurable architecture. We adopt an instruction set extension technique, where specialized instructions for an application are included into the instruction set of a GPP. These instructions correspond to sections of the application, and are executed in a dynamically reconfigurable datapath, added to the GPP's hardware. The central focus of this theses is the resource sharing problem in the design of reconfigurable datapaths. Since the application sections are modeled as control/data-ow graphs (CDFGs), the CDFG merging problem consists in designing a reconfigurable datapath with minimum area. We prove that this problem is NP-complete. Our main contributions are two heuristic algorithms to the CDFG merging problem. The first has the goal of minimizing the reconfigurable datapath interconnection area, while the second minimizes its total area. Experimental evaluation showed that our first heuristic produced an average 26.2% area reduction, with respect to the most used method. The maximum error of our solutions was on average 4.1%, and some optimal solutions were found. Our second algorithm approached, in execution times, the fastest previous solution, and produced datapaths with an average area reduction of 20%. When compared to the best known area solution, our approach produced slightly better areas, while achieving an average speedup of 2500. The proposed algorithm also produced smaller areas, when compared to an industry synthesis toolDoutoradoDoutor em Ciência da Computaçã

    Interconnect-aware scheduling and resource allocation for high-level synthesis

    Get PDF
    A high-level architectural synthesis can be described as the process of transforming a behavioral description into a structural description. The scheduling, processor allocation, and register binding are the most important tasks in the high-level synthesis. In the past, it has been possible to focus simply on the delays of the processing units in a high-level synthesis and neglect the wire delays, since the overall delay of a digital system was dominated by the delay of the logic gates. However, with the process technology being scaled down to deep-submicron region, the global interconnect delays can no longer be neglected in VLSI designs. It is, therefore, imperative to include in high-level synthesis the delays on wires and buses used to communicate data between the processing units i.e., inter-processor communication delays. Furthermore, the way the process of register binding is performed also has an impact on the complexity of the interconnect paths required to transfer data between the processing units. Hence, the register binding can no longer ignore its effect on the wiring complexity of resulting designs. The objective of this thesis is to develop techniques for an interconnect-aware high-level synthesis. Under this common theme, this thesis has two distinct focuses. The first focus of this thesis is on developing a new high-level synthesis framework while taking the inter-processor communication delay into consideration. The second focus of this thesis is on the developing of a technique to carry out the register binding and a scheme to reduce the number of registers while taking the complexity of the interconnects into consideration. A novel scheduling and processor allocation technique taking into consideration the inter-processor communication delay is presented. In the proposed technique, the communication delay between a pair of nodes of different types is treated as a non-computing node, whereas that between a pair of nodes of the same type is taken into account by re-adjusting the firing times of the appropriate nodes of the data flow graph (DFG). Another technique for the integration of the placement process into the scheduling and processor allocation in order to determine the actual positions of the processing units in the placement space is developed. The proposed technique makes use of a hybrid library of functional units, which includes both operation-specific and reconfigurable multiple-operation functional units, to maximize the local data transfer. A technique for register binding that results in a reduced number of registers and interconnects is developed by appropriately dividing the lifetime of a token into multiple segments and then binding those having the same source and/or destination into a single register. A node regeneration scheme, in which the idle processing units are utilized to generate multiple copies of the nodes in a given DFG, is devised to reduce the number of registers and interconnects even further. The techniques and schemes developed in this thesis are applied to the synthesis of architectures for a number of benchmark DSP problems and compared with various other commonly used synthesis methods in order to assess their effectiveness. It is shown that the proposed techniques provide superior performance in terms of the iteration period, placement area, and the numbers of the processing units, registers and interconnects in the synthesized architectur

    Design and development from single core reconfigurable accelerators to a heterogeneous accelerator-rich platform

    Get PDF
    The performance of a platform is evaluated based on its ability to deal with the processing of multiple applications of different nature. In this context, the platform under evaluation can be of homogeneous, heterogeneous or of hybrid architecture. The selection of an architecture type is generally based on the set of different target applications and performance parameters, where the applications can be of serial or parallel nature. The evaluation is normally based on different performance metrics, e.g., resource/area utilization, execution time, power and energy consumption. This process can also include high-level performance metrics, e.g., Operations Per Second (OPS), OPS/Watt, OPS/Hz, Watt/Area etc. An example of architecture selection can be related to a wireless communication system where the processing of computationally-intensive signal-processing algorithms has strict execution-time constraints and in this case, a platform with special-purpose accelerators is relatively more suitable than a typical homogeneous platform. A couple of decades ago, it was expensive to plant many special-purpose accelerators on a chip as the cost per unit area was relatively higher than today. The utilization wall is also becoming a limiting factor in homogeneous multicore scaling which means that all the cores on a platform cannot be operated at their maximum frequency due to a possible thermal meltdown. In this case, some of the processing cores have to be turned-off or to be operated at very low frequencies making most of the part of the chip to stay underutilized. A possible solution lies in the use of heterogeneous multicore platforms where many application-specific cores operate at lower frequencies, therefore reducing power dissipation density and increasing other performance parameters. However, to achieve maximum flexibility in processing, a general-purpose flavor can also be introduced by adding a few Reduced Instruction-Set Computing (RISC) cores. A power class of heterogeneous multicore platforms is an accelerator-rich platform where many application-specific accelerators are loosely connected with each other for work load distribution or to execute the tasks independently. This research work spans from the design and development of three different types of template-based Coarse-Grain Reconfigurable Arrays (CGRAs), i.e., CREMA, AVATAR and SCREMA to a Heterogeneous Accelerator-Rich Platform (HARP). The accelerators generated from the three CGRAs could perform different lengths and types of Fast Fourier Transform (FFT), real and complex Matrix-Vector Multiplication (MVM) algorithms. CREMA and AVATAR were fixed CGRAs with eight and sixteen number of Processing Element (PE) columns, respectively. SCREMA could flex between four, eight, sixteen and thirty two number of PE columns. Many case studies were conducted to evaluate the performance of the reconfigurable accelerators generated from these CGRA templates. All of these CGRAs work in a processor/coprocessor model tightly integrated with a Direct Memory Access (DMA) device. Apart from these platforms, a reconfigurable Application-Specific Instruction-set Processor (rASIP) is also designed, tested for FFT execution under IEEE-802.11n timing constraints and evaluated against a processor/coprocessor model. It was designed by integrating AVATAR generated radix-(2, 4) FFT accelerator into the datapath of a RISC processor. The instruction set of the RISC processor was extended to perform additional operations related to AVATAR. As mentioned earlier, the underutilized part of the chip, now-a-days called Dark Silicon is posing many challenges for the designers. Apart from software optimizations, clock gating, dynamic voltage/frequency scaling and other high-level techniques, one way of dealing with this problem is to use many application-specific cores. In an effort to maximize the number of reconfigurable processing resources on a platform, the accelerator-rich architecture HARP was designed and evaluated in terms of different performance metrics. HARP is constructed on a Network-on-Chip (NoC) of 3x3 nodes where with every node, a CGRA of application-specific size is integrated other than the central node which is attached to a RISC processor. The RISC establishes synchronization between the nodes for data transfer and also performs the supervisory control. While using the NoC as the backbone of communication between the cores, it becomes possible for all the cores to address each other and also perform execution simultaneously and independently of each other. The performance of accelerators generated from CREMA, AVATAR and SCREMA templates were evaluated individually and also when attached to HARP's NoC nodes. The individual CGRAs show promising results in their own capacity but when integrated all together in the framework of HARP, interesting comparisons were established in terms of overall execution times, resource utilization, operating frequencies, power and energy consumption. In evaluating HARP, estimates and measurements were also made in some advanced performance metrics, e.g., in MOPS/mW and MOPS/MHz. The overall research work promotes the idea of heterogeneous accelerator-rich platform as a solution to current problems and future needs of industry and academia

    Design and implementation of the morphoSys reconfigurable computing processor

    No full text
    Abstract. In this paper, we describe the implementation of MorphoSys, a reconfigurable processing system targeted at data-parallel and computation-intensive applications. The MorphoSys architecture consists of a reconfigurable component (an array of reconfigurable cells) combined with a RISC control processor and a high bandwidth memory interface. We briefly discuss the system-level model, array architecture, and control processor. Next, we present the detailed design implementation and the various aspects of physical layout of different subblocks of MorphoSys. The physical layout was constrained for 100 MHz operation, with low power consumption, and was implemented using 0.35 m, four metal layer CMOS (3.3 Volts) technology. We provide simulation results for the MorphoSys architecture (based on VHDL model) for some typical data-parallel applications (video compression and automatic target recognition). The results indicate that the MorphoSys system can achieve significantly better performance for most of these applications in comparison with other systems and processors. 1
    corecore