Search CORE

37 research outputs found

Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor

Author: Gan Qifeng
Publication venue
Publication date: 01/06/2013
Field of study

RÉSUMÉ Cette thèse considère le problème de l’implémentation de filtres particulaires (particle filters PFs) dans des processeurs à jeu d’instructions spécialisé (Application-Specific Instruction-set Processors ASIPs). Considérant la diversité et la complexité des PFs, leur implémentation requiert une grande efficacité dans les calculs et de la flexibilité dans leur conception. La conception de ASIPs peut se faire avec un niveau intéressant de flexibilité. Notre recherche se concentre donc sur l’amélioration du débit des PFs dans un environnement de conception de ASIP. Une approche générale est tout d’abord proposée pour caractériser la complexité computationnelle des PFs. Puisque les PFs peuvent être utilisés dans une vaste gamme d’applications, nous utilisons deux types de blocs afin de distinguer les propriétés des PFs. Le premier type est spécifique à l’application et le deuxième type est spécifique à l’algorithme. Selon les résultats de profilage, nous avons identifié que les blocs du calcul de la probabilité et du rééchantillonnage sont les goulots d’étranglement principaux des blocs spécifiques à l’algorithme. Nous explorons l’optimisation de ces deux blocs aux niveaux algorithmique et architectural. Le niveau algorithmique offre un grand potentiel d’accélération et d’amélioration du débit. Notre travail débute donc à ce niveau par l’analyse de la complexité des blocs du calcul de la probabilité et du rééchantillonnage, puis continue avec leur simplification et modification. Nous avons simplifié le bloc du calcul de la probabilité en proposant un mécanisme de quantification uniforme, l’algorithme UQLE. Les résultats démontrent une amélioration significative d’une implémentation logicielle, sans perte de précision. Le pire cas de l’algorithme UQLE implémenté en logiciel à virgule fixe avec 32 niveaux de quantification atteint une accélération moyenne de 23.7× par rapport à l’implémentation logicielle de l’algorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rééchantillonnage pour remplacer l’algorithme séquentiel de rééchantillonnage systématique (SR) dans les PFs. Ce sont l’algorithme SR reformulé et l’algorithme SR parallèle (PSR). L’algorithme SR reformulé combine un groupe de boucles en une boucle unique afin de faciliter sa parallélisation dans un ASIP. L’algorithme PSR rend les itérations indépendantes, permettant ainsi à l’algorithme de rééchantillonnage de s’exécuter en parallèle. De plus, l’algorithme PSR a une complexité computationnelle plus faible que l’algorithme SR. Du point de vue architectural, les ASIPs offrent un grand potentiel pour l’implémentation de PFs parce qu’ils présentent un bon équilibre entre l’efficacité computationnelle et la flexibilité de conception. Ils permettent des améliorations considérables en débit par l’inclusion d’instructions spécialisées, tout en conservant la facilité relative de programmation de processeurs à usage général. Après avoir identifié les goulots d’étranglement de PFs dans les blocs spécifiques à l’algorithme, nous avons généré des instructions spécialisées pour les algorithmes UQLE, SR reformulé et PSR. Le débit a été significativement amélioré par rapport à une implémentation purement logicielle tournant sur un processeur à usage général. L’implémentation de l’algorithme UQLE avec instruction spécialisée avec 32 intervalles atteint une accélération de 34× par rapport au pire cas de son implémentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implémentation de l’algorithme SR reformulé, avec quatre poids calculés en parallèle et huit catégories définies par des bornes uniformément distribuées qui sont comparées simultanément. Elle atteint une accélération de 23.9× par rapport à l’algorithme SR séquentiel dans un processeur à usage général. Le surcoût est limité à 54 K portes logiques additionnelles. Pour l’algorithme PSR, nous avons conçu quatre instructions spécialisées configurées pour supporter quatre poids entrés en parallèle. Elles mènent à une accélération de 53.4× par rapport à une implémentation de l’algorithme SR en virgule flottante sur un processeur à usage général, avec un surcoût de 47.3 K portes logiques additionnelles. Finalement, nous avons considéré une application du suivi vidéo et implémenté dans un ASIP un algorithme de FP basé sur un histogramme. Nous avons identifié le calcul de l’histogramme comme étant le goulot principal des blocs spécifiques à l’application. Nous avons donc proposé une architecture de calcul d’histogramme à réseau parallèle (PAHA) pour ASIPs. Les résultats d’implémentation démontrent qu’un PAHA à 16 voies atteint une accélération de 43.75× par rapport à une implémentation logicielle sur un processeur à usage général.----------ABSTRACT This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment. A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels. The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7× average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm. At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34× speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9× speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4× speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates. Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75× when compared to its software implementation in a general-purpose processor

PolyPublie

Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles

Author: Mbaye Mame Maria
Publication venue
Publication date: 01/08/2010
Field of study

RÉSUMÉ De nos jours, les industriels privilégient les architectures flexibles afin de réduire le temps et les coûts de conception d’un système. Les processeurs à usage spécifique (ASIP) fournissent beaucoup de flexibilité, tout en atteignant des performances élevées. Une tendance qui a de plus en plus de succès dans le processus de conception d’un système sur puce consiste à spécifier le comportement du système en langage évolué tel que le C, SystemC, etc. La spécification est ensuite utilisée durant le partitionement pour déterminer les composantes logicielles et matérielles du système. Avec la maturité des générateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boîtes à outils un nouveau type d’architecture, à savoir les ASIP, en sachant que ces derniers sont conçus à partir d’une spécification décrite en langage évolué. D’un autre côté, dans le monde matériel, et cela depuis très longtemps, les chercheurs ont vu l’avantage de baser le processus de conception sur un langage évolué. Cette recherche a abouti à l’avénement de générateurs automatiques de matériel sur le marché qui sont des outils d’aide à la conception comme CapatultC, Forte’s Cynthetizer, etc. Ainsi, avec tous ces outils basés sur le langage C, les concepteurs ont un choix de types de design élargi mais, d’un autre côté, les options de designs possibles explosent, ce qui peut allonger au lieu de réduire le temps de conception. C’est dans ce cadre que notre thèse doctorale s’inscrit, puisqu’elle présente des méthodologies d’exploration architecturale de design à usage spécifique pour l’accélération de boucles afin de réduire le temps de conception, entre autres. Cette thèse a débuté par l’exploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates à l’accélération, si elles comportent de bonnes possibilités de parallélisme et si ces dernières sont bien exploitées. Le matériel est très efficace à profiter des possibilités de parallélisme au niveau instruction, donc, une méthode de conception a été proposée. Cette dernière extrait le parallélisme d’une boucle afin d’exécuter plus d’opérations concurrentes dans des instructions spécialisées. Notre méthode se base aussi sur l’optimisation des données dans l’architecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forte’s Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances. Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation

PolyPublie

FPGA-aware techniques for rapid generation of profitable custom instructions

Author: Clarke C.T.
Lam S.-K.
Prakash A.
Srikanthan T.
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

OPUS

Techniques for Crafting Customizable MPSoCS

Author: CHEN LIANG
Publication venue
Publication date: 08/04/2014
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Static resource models for code generation of embedded processors

Author: Zhao Q.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2003
Field of study

xii+129hlm.;24c

Repository TU/e

Pure OAI Repository

uilis.unsyiah.ac.id

Advanced Applications of Rapid Prototyping Technology in Modern Engineering

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Rapid prototyping (RP) technology has been widely known and appreciated due to its flexible and customized manufacturing capabilities. The widely studied RP techniques include stereolithography apparatus (SLA), selective laser sintering (SLS), three-dimensional printing (3DP), fused deposition modeling (FDM), 3D plotting, solid ground curing (SGC), multiphase jet solidification (MJS), laminated object manufacturing (LOM). Different techniques are associated with different materials and/or processing principles and thus are devoted to specific applications. RP technology has no longer been only for prototype building rather has been extended for real industrial manufacturing solutions. Today, the RP technology has contributed to almost all engineering areas that include mechanical, materials, industrial, aerospace, electrical and most recently biomedical engineering. This book aims to present the advanced development of RP technologies in various engineering areas as the solutions to the real world engineering problems

Directory of Open Access Books (DOAB)

Customising compilers for customisable processors

Author: Murray Alastair Colin
Publication venue: The University of Edinburgh
Publication date: 29/11/2012
Field of study

The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements in the quality of the algorithms that discover and select which instructions to add to a processor. The use of automatic algorithms, however, result in instructions which are radically different from those found in conventional, human-designed, RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the compiler’s ability to exploit them. This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph isomorphism checking to exploit these complex instructions. Operating in a separate pass permits techniques to be applied that are uniquely suited for mapping complex instructions, but unsuitable for conventional instruction selection. The existing, mature, compiler back-end can then handle the remainder of the compilation. With this method, the high-level pass was able to use 1965 different automatically produced instructions to obtain an initial average speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate simulator. This result was improved following an investigation of how the produced instructions were being used by the compiler. It was established that the models the automatic tools were using to develop instructions did not take account of how well the compiler could realistically use them. Adding additional parameters to the search heuristic to account for compiler issues increased the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface was also investigated and this achieved a speed-up of 1.26x while reducing hardware and compiler complexity. A complementary, high-level, method of exploiting dual memory banks was created to increase memory bandwidth to accommodate the increased data-processing bandwidth provided by extension instructions. Finally, the compiler was considered for use in a non-conventional role where rather than generating code it is used to apply source-level transformations prior to the generation of extension instructions and thus affect the shape of the instructions that are generated

Edinburgh Research Archive

Efficient design-space exploration of custom instruction-set extensions

Author: Zuluaga Marcela
Publication venue: The University of Edinburgh
Publication date: 01/01/2010
Field of study

Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space, the predictive model is shown to reduce by two orders of magnitude the number of executions of the resource-sharing algorithm that are required in order to find the optimal trade-offs. This thesis presents a technique that is the first one to combine the design spaces of ISE selection and resource sharing in ISE datapath synthesis, in order to offer the designer solutions that achieve maximum speedup and maximum resource utilization using the available area. Optimal trade-offs in the design space are found by guiding the selection process to favour ISE combinations that are likely to share resources with low speedup losses. Experimental results show that this combined approach unveils new trade-offs between speedup and area that are not identified by previous selection techniques; speedups of up to 238% over previous selection thecniques were obtained. Finally, multi-cycle ISEs can be pipelined in order to increase their throughput. However, it is shown that traditional ISE identification techniques do not allow this optimization due to control flow overhead. In order to obtain the benefits of overlapping loop executions, this thesis proposes to carefully insert loop control flow statements into the ISEs, thus allowing the ISE to control the iterations of the loop. The proposed ISEs broaden the scope of instruction-level parallelism and obtain higher speedups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. A detailed case study of a real application shows that the proposed method achieves 91% higher speedups than the state-of-the-art, with an area overhead of less than 8% in hardware implementation

CiteSeerX

Edinburgh Research Archive