99 research outputs found

    Instruction-set architecture synthesis for VLIW processors

    Get PDF

    Automated design of domain-specific custom instructions

    Get PDF

    Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

    Get PDF
    Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution

    A Probabilistic Approach for the System-Level Design of Multi-ASIP Platforms

    Get PDF

    Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

    Get PDF
    Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution

    Computer aided design of cluster-based ASIPs

    Get PDF

    Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor

    Get PDF
    RÉSUMÉ Cette thĂšse considĂšre le problĂšme de l’implĂ©mentation de filtres particulaires (particle filters PFs) dans des processeurs Ă  jeu d’instructions spĂ©cialisĂ© (Application-Specific Instruction-set Processors ASIPs). ConsidĂ©rant la diversitĂ© et la complexitĂ© des PFs, leur implĂ©mentation requiert une grande efficacitĂ© dans les calculs et de la flexibilitĂ© dans leur conception. La conception de ASIPs peut se faire avec un niveau intĂ©ressant de flexibilitĂ©. Notre recherche se concentre donc sur l’amĂ©lioration du dĂ©bit des PFs dans un environnement de conception de ASIP. Une approche gĂ©nĂ©rale est tout d’abord proposĂ©e pour caractĂ©riser la complexitĂ© computationnelle des PFs. Puisque les PFs peuvent ĂȘtre utilisĂ©s dans une vaste gamme d’applications, nous utilisons deux types de blocs afin de distinguer les propriĂ©tĂ©s des PFs. Le premier type est spĂ©cifique Ă  l’application et le deuxiĂšme type est spĂ©cifique Ă  l’algorithme. Selon les rĂ©sultats de profilage, nous avons identifiĂ© que les blocs du calcul de la probabilitĂ© et du rĂ©Ă©chantillonnage sont les goulots d’étranglement principaux des blocs spĂ©cifiques Ă  l’algorithme. Nous explorons l’optimisation de ces deux blocs aux niveaux algorithmique et architectural. Le niveau algorithmique offre un grand potentiel d’accĂ©lĂ©ration et d’amĂ©lioration du dĂ©bit. Notre travail dĂ©bute donc Ă  ce niveau par l’analyse de la complexitĂ© des blocs du calcul de la probabilitĂ© et du rĂ©Ă©chantillonnage, puis continue avec leur simplification et modification. Nous avons simplifiĂ© le bloc du calcul de la probabilitĂ© en proposant un mĂ©canisme de quantification uniforme, l’algorithme UQLE. Les rĂ©sultats dĂ©montrent une amĂ©lioration significative d’une implĂ©mentation logicielle, sans perte de prĂ©cision. Le pire cas de l’algorithme UQLE implĂ©mentĂ© en logiciel Ă  virgule fixe avec 32 niveaux de quantification atteint une accĂ©lĂ©ration moyenne de 23.7× par rapport Ă  l’implĂ©mentation logicielle de l’algorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rĂ©Ă©chantillonnage pour remplacer l’algorithme sĂ©quentiel de rĂ©Ă©chantillonnage systĂ©matique (SR) dans les PFs. Ce sont l’algorithme SR reformulĂ© et l’algorithme SR parallĂšle (PSR). L’algorithme SR reformulĂ© combine un groupe de boucles en une boucle unique afin de faciliter sa parallĂ©lisation dans un ASIP. L’algorithme PSR rend les itĂ©rations indĂ©pendantes, permettant ainsi Ă  l’algorithme de rĂ©Ă©chantillonnage de s’exĂ©cuter en parallĂšle. De plus, l’algorithme PSR a une complexitĂ© computationnelle plus faible que l’algorithme SR. Du point de vue architectural, les ASIPs offrent un grand potentiel pour l’implĂ©mentation de PFs parce qu’ils prĂ©sentent un bon Ă©quilibre entre l’efficacitĂ© computationnelle et la flexibilitĂ© de conception. Ils permettent des amĂ©liorations considĂ©rables en dĂ©bit par l’inclusion d’instructions spĂ©cialisĂ©es, tout en conservant la facilitĂ© relative de programmation de processeurs Ă  usage gĂ©nĂ©ral. AprĂšs avoir identifiĂ© les goulots d’étranglement de PFs dans les blocs spĂ©cifiques Ă  l’algorithme, nous avons gĂ©nĂ©rĂ© des instructions spĂ©cialisĂ©es pour les algorithmes UQLE, SR reformulĂ© et PSR. Le dĂ©bit a Ă©tĂ© significativement amĂ©liorĂ© par rapport Ă  une implĂ©mentation purement logicielle tournant sur un processeur Ă  usage gĂ©nĂ©ral. L’implĂ©mentation de l’algorithme UQLE avec instruction spĂ©cialisĂ©e avec 32 intervalles atteint une accĂ©lĂ©ration de 34× par rapport au pire cas de son implĂ©mentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implĂ©mentation de l’algorithme SR reformulĂ©, avec quatre poids calculĂ©s en parallĂšle et huit catĂ©gories dĂ©finies par des bornes uniformĂ©ment distribuĂ©es qui sont comparĂ©es simultanĂ©ment. Elle atteint une accĂ©lĂ©ration de 23.9× par rapport Ă  l’algorithme SR sĂ©quentiel dans un processeur Ă  usage gĂ©nĂ©ral. Le surcoĂ»t est limitĂ© Ă  54 K portes logiques additionnelles. Pour l’algorithme PSR, nous avons conçu quatre instructions spĂ©cialisĂ©es configurĂ©es pour supporter quatre poids entrĂ©s en parallĂšle. Elles mĂšnent Ă  une accĂ©lĂ©ration de 53.4× par rapport Ă  une implĂ©mentation de l’algorithme SR en virgule flottante sur un processeur Ă  usage gĂ©nĂ©ral, avec un surcoĂ»t de 47.3 K portes logiques additionnelles. Finalement, nous avons considĂ©rĂ© une application du suivi vidĂ©o et implĂ©mentĂ© dans un ASIP un algorithme de FP basĂ© sur un histogramme. Nous avons identifiĂ© le calcul de l’histogramme comme Ă©tant le goulot principal des blocs spĂ©cifiques Ă  l’application. Nous avons donc proposĂ© une architecture de calcul d’histogramme Ă  rĂ©seau parallĂšle (PAHA) pour ASIPs. Les rĂ©sultats d’implĂ©mentation dĂ©montrent qu’un PAHA Ă  16 voies atteint une accĂ©lĂ©ration de 43.75× par rapport Ă  une implĂ©mentation logicielle sur un processeur Ă  usage gĂ©nĂ©ral.----------ABSTRACT This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment. A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels. The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7× average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm. At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34× speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9× speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4× speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates. Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75× when compared to its software implementation in a general-purpose processor
    • 

    corecore