9 research outputs found

    Vesyla-II: An Algorithm Library Development Tool for Synchoros VLSI Design Style

    Full text link
    High-level synthesis (HLS) has been researched for decades and is still limited to fast FPGA prototyping and algorithmic RTL generation. A feasible end-to-end system-level synthesis solution has never been rigorously proven. Modularity and composability are the keys to enabling such a system-level synthesis framework that bridges the huge gap between system-level specification and physical level design. It implies that 1) modules in each abstraction level should be physically composable without any irregular glue logic involved and 2) the cost of each module in each abstraction level is accurately predictable. The ultimate reasons that limit how far the conventional HLS can go are precisely that it cannot generate modular designs that are physically composable and cannot accurately predict the cost of its design. In this paper, we propose Vesyla, not as yet another HLS tool, but as a synthesis tool that positions itself in a promising end-to-end synthesis framework and preserving its ability to generate physically composable modular design and to accurately predict its cost metrics. We present in the paper how Vesyla is constructed focusing on the novel platform it targets and the internal data structures that highlights the uniqueness of Vesyla. We also show how Vesyla will be positioned in the end-to-end synchoros synthesis framework called SiLago

    A Compiler Framework for a Coarse-Grained Reconfigurable Array

    Get PDF
    The number of transistors on a chip is increasing with time giving rise to multiple design challenges. In this context, reconfigurable architectures have emerged to provide high flexibility, less power/energy consumption yet while delivering high performance levels. The success of an embedded architecture depends on powerful compiler support. Current studies are focused on developing compilers to reduce the designer’s effort by introducing many automation related features. In this thesis work, a compiler framework is presented for a scalable Coarse-Grained Reconfigurable Array (CGRA) called SCREMA. The compiler framework presented in this thesis replaces the exiting GUI compiler with an added feature of automatic placement and routing. The compiler receives a Reverse Polish Notation (RPN) description of the target algorithm by the user. It extracts the computational information from the RPN description and performs placement and routing over the CGRA template. The first configuration stream generated by the compiler is the main processing context. Furthermore, if additional configuration patterns have to be designed, the compiler framework gives the possibility to implement them in two different design paradigms: a preprocessing context and a canonical context. Pre-processing context is used to align the data into a CGRA to facilitate post-processing. Canonical context allows the user to perform additions in sum-of-products related algorithms. The compiler framework has been tested by implementing real integer Matrix-Vector Multiplication (MVM) algorithms. Specifically, the tested MVM orders are 4th, 8th, 16th and 32nd on the CGRA sizes of 4x4, 4x8, 4x16 and 4x32 PEs, respectively. All the implementation are based on the RPN description of 4th-order MVM. Other than implementing 4th-order MVM, the rest of tested MVM algorithms need preprocessing and canonical contexts to be designed and implemented. The user effort which was needed to Place and Route (P&R) an algorithm manually on SCREMA is now reduced by using this compiler framework as it provides an automatic P&R mechanism

    Automatic synthesis of reconfigurable instruction set accelerators

    Get PDF

    Domain specific high performance reconfigurable architecture for a communication platform

    Get PDF

    Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware

    Get PDF
    RÉSUMÉ Les architectures reconfigurables à large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accélérer une large gamme d’applications. Ces architectures utilisent la nature parallèle de l’architecture matérielle pour accélérer les calculs. Les architectures reconfigurables à large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable à grain fin) et le processeur. Elles contrastent généralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilité (meilleures). La programmation d’architectures reconfigurables est un défi qui date depuis longtemps et pose plusieurs problèmes. Les programmeurs doivent être avisés des caractéristiques du matériel sur lequel ils travaillent et connaître des langages de description matériels tels que VHDL et Verilog au lieu de langages de programmation séquentielle. L’implémentation d’un algorithme sur FPGA s’avère plus difficile que de le faire sur des CPU ou des GPU. Les implémentations à base de processeurs ont déjà leur chemin de données pré synthétisé et ont besoin uniquement d’un programme pour le contrôler. Par contre, dans un FPGA, le développeur doit créer autant le chemin de données que le contrôleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmétiques dédiées qui sont disponibles dans une FPGA est une tâche difficile qui requiert beaucoup de temps. Seulement les spécialistes dans le design de circuits peuvent le faire. Ce projet est fondé sur un tissu de calcul générique contrôlé par les données qui a été proposé par le professeur J.P David et a déjà été implémenté par un étudiant à la maîtrise M. Allard. Cette architecture est principalement formée de trois composants: l’unité arithmétique et logique partagée (Shared Arithmetic Logic Unit –SALU-), la machine à état pour le jeton des données (Token State Machine –TSM-) et la banque de FIFO (FIFO Bank –FB-). Cette architecture est semblable aux architectures reconfigurables à large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrôlée par les données.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility. Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations “only” require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs

    Conception et implémentation d'un treillis de calcul configurable à deux niveaux

    Get PDF
    Résumé De nos jours, la technologie FPGA est devenue de plus en plus puissante et complexe à un niveau que seule la technologie ASIC pouvait atteindre il y a quelques années. Les FPGA peuvent inclure plusieurs processeurs, des unités de traitement spécialisées, des réseaux sur puce pour le routage des données en interne, etc. Bien que le FPGA fonctionne à une fréquence moindre qu'un processeur à usage général, la nature parallèle de la logique matérielle permet tout de même d'opérer des algorithmes beaucoup plus rapidement. En combinant les meilleures propositions d'un système basé sur processeur ou d'un ASIC, les FPGA ont ouvert une grande flexibilité et la possibilité d'un prototypage rapide pour les ingénieurs et scientifiques de toutes les expertises. Toutefois, malgré les nombreux progrès réalisés dans la description de systèmes matériels à des niveaux d'abstraction plus élevés (des modules IP configurables, des réseaux sur puces, des processeurs configurables etc.), la réalisation d'architectures complexes est encore aujourd'hui un travail réservé aux spécialistes (typiquement des ingénieurs en conception de circuits numériques). De plus, c'est un processus relativement long (des mois, voire des années) si on considère non seulement le temps de conception de l'architecture mais encore le temps pour la vérifier et l’optimiser. De plus, les ressources disponibles augmentent à chaque nouvelle mouture, menant à des architectures de plus en plus élaborées et à une gestion de plus en plus difficile. Le FPGA proposerait alors des performances et un pouvoir de calcul inégalés, mais serait difficilement utilisable car on n'arriverait pas à extraire de façon simple et efficace toute cette puissance. Le but ultime serait d'avoir des performances du niveau matériel avec la flexibilité et la simplicité de développement du logiciel. Le projet consiste à faire la conception et l’implémentation d’une toute nouvelle architecture de type treillis permettant de traiter des algorithmes sur un grand flot de données. Les algorithmes ayant de grandes possibilités de parallélisme seraient avantagés par ce treillis. Ce treillis de calcul est configurable à deux niveaux d’abstraction. Au plus bas niveau (niveau matériel), l’architecture est constituée de divers blocs permettant de réaliser les différents chemins de données et de contrôle. Les données se propagent donc de mémoires en mémoires à travers des ALU. Ces transactions sont contrôlées par le niveau plus haut (niveau de configuration logicielle). En effet, l’utilisateur du treillis pourra venir implémenter ses algorithmes à travers de petites mémoires à instructions situées dans l’architecture. Il sera également possible de venir reconfigurer dynamiquement le comportement du treillis. On permet donc à des programmeurs logiciels d’exploiter toute la puissance d’une implémentation matérielle, sans toutefois devoir la développer en détail. Cela évite aux utilisateurs d’avoir à apprendre à exploiter les FPGA à un niveau matériel, tout en gardant confidentielle le détail interne de la dite architecture.----------Abstract Nowadays, FPGA technology has become more powerful and complex at a level that only ASIC could reach a few years ago. FPGAs can now include several processors, specialized processing units, on-chip networks for routing internal data, etc. Although the FPGA operates at a smaller frequency than a general purpose processor, the parallel nature of the hardware logic still allows algorithms to operate much faster. By combining the best parts of a system based on processor or a ASIC, FPGAs have allowed a great deal of flexibility and the possibility of rapid prototyping for engineers and scientists of all expertise. However, despite the many advances in hardware systems described at higher levels of abstraction, the implementation of complex architecture is still a job for specialists (electronic engineers designing digital circuits). Moreover, it is a relatively long process (months or years) if we consider not only the time to design the architecture but also the time to verify and optimize it. In addition, available resources are increasing in each new FPGA version, leading architectures to become more elaborate and management more difficult. The FPGA proposes very high performance and unrivaled computing power, but may become obsolete because it is difficult to extract in a simple and effective way all that power. The ultimate goal would be to have hardware performance with software development simplicity and flexibility. The project involves the design and implementation of a new architecture (mesh type) for processing algorithms on a large data stream. Algorithms with great potential for parallelism would benefit from this lattice. The mesh is configurable at two levels of abstraction. At the lowest level (hardware level), the architecture consists of various blocks supporting the different data paths and control structures. The data thus propagate from memories to memories through ALUs. These transactions are controlled by a higher representation (software level). In fact, the user can implement algorithms on the lattice through small memories placed within the architecture. With the proposed structure, it is also possible to dynamically reconfigure the behavior of the lattice. It thus allows software programmers to harness the power of a hardware implementation, without further notice. This prevents users from having to learn how to use the FPGA while allowing to keep confidential the architecture itself. This work is part of an industrial partnership with a company financing the project with a MITACS internship. The company manufactures and markets frame grabbers operating at high frequencies and offering high resolutions. These cards are already using FPGAs with an important part of these circuits being unused logic. Customers are asking to use this logic to perform preprocessing algorithms. However, most of them would be unable to use it effectively and without harming the rest of the logic required for proper operation of the board. In addition, the partner does not want to disclose the source code of the architecture implemented in the FPGA. The relevance of the proposed project is therefore justified by the fact that it would be possible to deliver boards with the architecture (in addition to the logic already used). The customer would only specify his high-level applications

    THE VLIW-SUPERCISC COMPILER: EXPLOITINGPARALLELISM FROM C-BASED APPLICATIONS

    Get PDF
    A common approach to decreasing embedded application execution time is creating a homogeneous parallel processor architecture. The parallelism of any such architecture is limited to the number of instructions that can be scheduled in the same cycle. This number of instructions scheduled in a cycle, or instruction-level parallelism (ILP), is limited by the ability to extract parallelism from the application. Other techniques attempt to improve performance with hardware acceleration. Often, segments of highly computational extensive code are extracted and custom hardware is created to replace the software execution. This technique requires many resources and still does not address the segments of code outside of the computationally extensive kernel.To solve this problem, hardware acceleration for computationally intensive segments of code in addition to accelerating the entire application with very long instruction word, VLIW, techniques is proposed. (1) A compilation flow that targets a 4-wide VLIW processor architecture is presented. This system was used to investigate the available speed-up of VLIW architectures. The architecture was modified to combine the VLIW processor with the capability to execute application specific customized instructions. To create the custom instruction hardware, a control and data flow graph (CDFG) framework was created. The CDFG framework was created to provide a framework for compiler transformations and hardware generation. In order to remove control flow from segments of code selected for hardware generation, (2) the technique of hardware predication was developed. Hardware predication allows if-then and if-then-else control flow constructs to be transformed into strict data flow through the use of multiplexors. From the transformed CDFGs, (3) a VHDL generation pass was created that translates the compiler data structures into synthesizable VHDL. The resulting architecture contains the VLIW processor and tightly coupled application specific hardware. This architecture was analyzed for performance changes comparedto the initial VLIW architecture, and a traditional processor. Lastly, (4) the architecture was analyzed for power and energy savings. A post static timing pass was added to the compilation flow for the insertion of hardware to delay early switching of operations.By measuring only the execution of the hardware function and comparing the performance to the equivalent code executed in software, a performance multiplier of up to 322 times is seen when synthesized onto an Altera Stratix II ES2S180F1508C4 FPGA. The average performance increase seen was 63 times faster. For the entire application, the speedup reached nearly 30X and was on average 12X better than a single processor implementation. The power and energy required by the VLIW processor core and the hardware functions for the computational kernels after 160nm OKI standard cell ASIC synthesis show a maximum power savings of 417 times that of execution on the processor with an average of 133 times savings in power consumption. With the increased execution time and the savings in power the energy savings will see a multiplicative effect. The energy improvement is therefore several orders of magnitude for the hardware functions, the savings range from over 1,000X to approximately 60,000X

    Algoritmos para alocação de recursos em arquiteturas reconfiguraveis

    Get PDF
    Orientador: Guido Costa Souza de AraujoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Pesquisas recentes na área de arquiteturas reconfiguráveis mostram que elas oferecem um desempenho melhor que os processadores de propósito geral (GPPs - General Purpose Processors), aliado a uma maior flexibilidade que os ASICs (Application Specific Integrated Circuits). Uma mesma arquitetura recongurável pode ser adaptada para implementar aplicações diferentes, permitindo a especialização do hardware de acordo com a demanda computacional da aplicação. Neste trabalho, nos estudamos o projeto de sistemas dedicados baseado em uma arquitetura reconfigurável. Adotamos a abordagem de extensão do conjunto de instruções, na qual o conjunto de instruções de um GPP e acrescido de instruções especializadas para uma aplicação. Estas instruções correspondem a trechos da aplicação e são executadas em um datapath dinamicamente recongurável, adicionado ao hardware do GPP. O tema central desta tese e o problema de compartilhamento de recursos no projeto do datapath reconfigurável. Dado que os trechos da aplicação são modelados como grafos de luxo de dados e controle (Control/Data-Flow Graphs ¿ CDFGs), o problema de combinação de CDFGs consiste em projetar um datapath reconfigurável com área mínima. Nos apresentamos uma demonstração de que este problema e NP-completo. Nossas principais contribuições são dois algoritmos heurísticos para o problema de combinação de CDFGs. O primeiro tem o objetivo de minimizar a área das interconexões do datapath reconfigurável, enquanto que o segundo visa a minimização da área total. Avaliações experimentais mostram que nossa primeira heurística resultou em uma redução media de 26,2% na área das interconexões, em relação ao método mais utilizado na literatura. O erro máximo de nossas soluções foi em media 4,1% e algumas soluções ótimas foram obtidas. Nosso segundo algoritmo teve tempos de execução comparáveis ao método mais rápido conhecido, obtendo uma redução media de 20% na área. Em relação ao melhor método para área conhecido, nossa heurística produziu áreas um pouco menores, alcançando um speed up médio de 2500. O algoritmo proposto também produziu áreas menores, quando comparado a uma ferramenta de síntese comercialAbstract: Recent work in reconfigurable architectures shows that they ofter a better performance than general purpose processors (GPPs), while offering more exibility than ASICs (Application Specific Integrated Circuits). A reconfigurable architecture can be adapted to implement different applications, thus allowing the specialization of the hardware according to the computational demands. In this work we describe an embedded systems project based on a reconfigurable architecture. We adopt an instruction set extension technique, where specialized instructions for an application are included into the instruction set of a GPP. These instructions correspond to sections of the application, and are executed in a dynamically reconfigurable datapath, added to the GPP's hardware. The central focus of this theses is the resource sharing problem in the design of reconfigurable datapaths. Since the application sections are modeled as control/data-ow graphs (CDFGs), the CDFG merging problem consists in designing a reconfigurable datapath with minimum area. We prove that this problem is NP-complete. Our main contributions are two heuristic algorithms to the CDFG merging problem. The first has the goal of minimizing the reconfigurable datapath interconnection area, while the second minimizes its total area. Experimental evaluation showed that our first heuristic produced an average 26.2% area reduction, with respect to the most used method. The maximum error of our solutions was on average 4.1%, and some optimal solutions were found. Our second algorithm approached, in execution times, the fastest previous solution, and produced datapaths with an average area reduction of 20%. When compared to the best known area solution, our approach produced slightly better areas, while achieving an average speedup of 2500. The proposed algorithm also produced smaller areas, when compared to an industry synthesis toolDoutoradoDoutor em Ciência da Computaçã
    corecore