Search CORE

5 research outputs found

A framework for automated and optimized ASIP implementation supporting multiple hardware description languages

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2005
Field of study

Crossref

VLIW プロセッサヲモチイタクミコミシステムノテイショウヒデンリョクカセッケイシュホウ

Author: Kobayashi Yuki
コバヤシユウキ
小林悠記
Publication venue
Publication date
Field of study

Osaka University Knowledge Archive

Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor

Author: Gan Qifeng
Publication venue
Publication date: 01/06/2013
Field of study

RÉSUMÉ Cette thèse considère le problème de l’implémentation de filtres particulaires (particle filters PFs) dans des processeurs à jeu d’instructions spécialisé (Application-Specific Instruction-set Processors ASIPs). Considérant la diversité et la complexité des PFs, leur implémentation requiert une grande efficacité dans les calculs et de la flexibilité dans leur conception. La conception de ASIPs peut se faire avec un niveau intéressant de flexibilité. Notre recherche se concentre donc sur l’amélioration du débit des PFs dans un environnement de conception de ASIP. Une approche générale est tout d’abord proposée pour caractériser la complexité computationnelle des PFs. Puisque les PFs peuvent être utilisés dans une vaste gamme d’applications, nous utilisons deux types de blocs afin de distinguer les propriétés des PFs. Le premier type est spécifique à l’application et le deuxième type est spécifique à l’algorithme. Selon les résultats de profilage, nous avons identifié que les blocs du calcul de la probabilité et du rééchantillonnage sont les goulots d’étranglement principaux des blocs spécifiques à l’algorithme. Nous explorons l’optimisation de ces deux blocs aux niveaux algorithmique et architectural. Le niveau algorithmique offre un grand potentiel d’accélération et d’amélioration du débit. Notre travail débute donc à ce niveau par l’analyse de la complexité des blocs du calcul de la probabilité et du rééchantillonnage, puis continue avec leur simplification et modification. Nous avons simplifié le bloc du calcul de la probabilité en proposant un mécanisme de quantification uniforme, l’algorithme UQLE. Les résultats démontrent une amélioration significative d’une implémentation logicielle, sans perte de précision. Le pire cas de l’algorithme UQLE implémenté en logiciel à virgule fixe avec 32 niveaux de quantification atteint une accélération moyenne de 23.7× par rapport à l’implémentation logicielle de l’algorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rééchantillonnage pour remplacer l’algorithme séquentiel de rééchantillonnage systématique (SR) dans les PFs. Ce sont l’algorithme SR reformulé et l’algorithme SR parallèle (PSR). L’algorithme SR reformulé combine un groupe de boucles en une boucle unique afin de faciliter sa parallélisation dans un ASIP. L’algorithme PSR rend les itérations indépendantes, permettant ainsi à l’algorithme de rééchantillonnage de s’exécuter en parallèle. De plus, l’algorithme PSR a une complexité computationnelle plus faible que l’algorithme SR. Du point de vue architectural, les ASIPs offrent un grand potentiel pour l’implémentation de PFs parce qu’ils présentent un bon équilibre entre l’efficacité computationnelle et la flexibilité de conception. Ils permettent des améliorations considérables en débit par l’inclusion d’instructions spécialisées, tout en conservant la facilité relative de programmation de processeurs à usage général. Après avoir identifié les goulots d’étranglement de PFs dans les blocs spécifiques à l’algorithme, nous avons généré des instructions spécialisées pour les algorithmes UQLE, SR reformulé et PSR. Le débit a été significativement amélioré par rapport à une implémentation purement logicielle tournant sur un processeur à usage général. L’implémentation de l’algorithme UQLE avec instruction spécialisée avec 32 intervalles atteint une accélération de 34× par rapport au pire cas de son implémentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implémentation de l’algorithme SR reformulé, avec quatre poids calculés en parallèle et huit catégories définies par des bornes uniformément distribuées qui sont comparées simultanément. Elle atteint une accélération de 23.9× par rapport à l’algorithme SR séquentiel dans un processeur à usage général. Le surcoût est limité à 54 K portes logiques additionnelles. Pour l’algorithme PSR, nous avons conçu quatre instructions spécialisées configurées pour supporter quatre poids entrés en parallèle. Elles mènent à une accélération de 53.4× par rapport à une implémentation de l’algorithme SR en virgule flottante sur un processeur à usage général, avec un surcoût de 47.3 K portes logiques additionnelles. Finalement, nous avons considéré une application du suivi vidéo et implémenté dans un ASIP un algorithme de FP basé sur un histogramme. Nous avons identifié le calcul de l’histogramme comme étant le goulot principal des blocs spécifiques à l’application. Nous avons donc proposé une architecture de calcul d’histogramme à réseau parallèle (PAHA) pour ASIPs. Les résultats d’implémentation démontrent qu’un PAHA à 16 voies atteint une accélération de 43.75× par rapport à une implémentation logicielle sur un processeur à usage général.----------ABSTRACT This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment. A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels. The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7× average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm. At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34× speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9× speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4× speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates. Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75× when compared to its software implementation in a general-purpose processor

PolyPublie

Cost-Efficient Soft-Error Resiliency for ASIP-based Embedded Systems

Author: Li Tuo
Publication venue: UNSW, Sydney
Publication date: 01/01/2013
Field of study

Recent decades have witnessed the rapid growth of embedded systems. At present, embedded systems are widely applied in a broad range of critical applications including automotive electronics, telecommunication, healthcare, industrial electronics, consumer electronics military and aerospace. Human society will continue to be greatly transformed by the pervasive deployment of embedded systems. Consequently, substantial amount of efforts from both industry and academic communities have contributed to the research and development of embedded systems. Application-specific instruction-set processor (ASIP) is one of the key advances in embedded processor technology, and a crucial component in some embedded systems. Soft errors have been directly observed since the 1970s. As devices scale, the exponential increase in the integration of computing systems occurs, which leads to correspondingly decrease in the reliability of computing systems. Today, major research forums state that soft errors are one of the major design technology challenges at and beyond the 22 nm technology node. Therefore, a large number of soft-error solutions, including error detection and recovery, have been proposed from differing perspectives. Nonetheless, most of the existing solutions are designed for general or high-performance systems which are different to embedded systems. For embedded systems, the soft-error solutions must be cost-efficient, which requires the tailoring of the processor architecture with respect to the feature of the target application. This thesis embodies a series of explorations for cost-efficient soft-error solutions for ASIP-based embedded systems. In this exploration, five major solutions are proposed. The first proposed solution realizes checkpoint recovery in ASIPs. By generating customized instructions, ASIP-implemented checkpoint recovery can perform at a finer granularity than what was previously possible. The fault-free performance overhead of this solution is only 1.45% on average. The recovery delay is only 62 cycles at the worst case. The area and leakage power overheads are 44.4% and 45.6% on average. The second solution explores utilizing two primitive error recovery techniques jointly. This solution includes three application-specific optimization methodologies. This solution generates the optimized error-resilient ASIPs, based on the characteristics of primitive error recovery techniques, static reliability analysis and design constraints. The resultant ASIP can be configured to perform at runtime according to the optimized recovery scheme. This solution can strategically enhance cost-efficiency for error recovery. In order to guarantee cost-efficiency in unpredictable runtime situations, the third solution explores runtime adaptation for error recovery. This solution aims to budget and adapt the error recovery operations, so as to spend the resources intelligently and to tolerate adverse influences of runtime variations. The resultant ASIP can make runtime decisions to determine the activation of spatial and temporal redundancies, according to the runtime situations. At the best case, this solution can achieve almost 50x reliability gain over the state of the art solutions. Given the increasing demand for multi-core computing systems, the last two proposed solutions target error recovery in multi-core ASIPs. The first solution of these two explores ASIP-implemented fine-grained process migration. This solution is a key infrastructure, which allows cost-efficient task management, for realizing cost-efficient soft-error recovery in multi-core ASIPs. The average time cost is only 289 machine cycles to perform process migration. The last solution explores using dynamic and adaptive mapping to assign heterogeneous recovery operations to the tasks in the multi-core context. This solution allows each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics, in terms of soft error vulnerability and execution time deadline. This solution can significantly improve the reliability of the system by almost two times, with graceful constraint penalty, in comparison to the state-of-the-art counterparts

UNSWorks