8 research outputs found
Algorithms for Improving the Automatically Synthesized Instruction Set of an Extensible Processor
Processors with extensible instruction sets are often used today as
programmable hardware accelerators for various domains. When extending RISC-V
and other similar extensible processor architectures, the task of designing
specialized instructions arises. This task can be solved automatically by using
instruction synthesis algorithms. In this paper, we consider algorithms that
can be used in addition to the known approaches and improve the synthesized
instruction sets by recomputing common operations (the result of which is
consumed by multiple operations) of a program inside clustered synthesized
instructions (common operations clustering algorithm), and by identifying
redundant (which have equivalents among the other instructions) synthesized
instructions (subsuming functions algorithm).
Experimental evaluations of the developed algorithms are presented for the
tests from the domains of cryptography and three-dimensional graphics. For
Magma cipher test, the common operations clustering algorithm allows reducing
the size of the compiled code by 9%, and the subsuming functions algorithm
allows reducing the synthesized instruction set extension size by 2 times. For
AES cipher test, the common operations clustering algorithm allows reducing the
size of the compiled code by 10%, and the subsuming functions algorithm allows
reducing the synthesized instruction set extension size by 2.5 times. Finally,
for the instruction set extension from Volume Ray-Casting test, the additional
use of subsuming functions algorithm allows reducing problem-specific
instruction extension set size from 5 to only 2 instructions without losing its
functionality
ASAM : Automatic Architecture Synthesis and Application Mapping; dl. 3.2: Instruction set synthesis
No abstract
An Optimal Formulation for Handling SLD in Impairment Aware WDM Optical Networks
The effect of physical layer impairments in route and wavelength assignment in Wavelength Division Multiplexed optical networks has become an important research area. When the quality of an optical signal degrades to an unacceptable level, a regenerator must be used to recover the quality of the signal. Most research has focused on reducing the number of regenerators when handling static and ad-hoc lightpath demands in such networks. In networks handling scheduled lightpath demands (SLD), each request for communication has a known duration and start time. Handling SLD in impairment aware networks has not been investigated in depth yet. We propose to study the development of an optimal formulation for SLD, using a minimum number of regenerators. We will compare our optimal formulation with another formulation which has been proposed recently
Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor
RĂSUMĂ
Cette thĂšse considĂšre le problĂšme de lâimplĂ©mentation de filtres particulaires (particle filters PFs) dans des processeurs Ă jeu dâinstructions spĂ©cialisĂ© (Application-Specific Instruction-set Processors ASIPs). ConsidĂ©rant la diversitĂ© et la complexitĂ© des PFs, leur implĂ©mentation requiert une grande efficacitĂ© dans les calculs et de la flexibilitĂ© dans leur conception. La conception de ASIPs peut se faire avec un niveau intĂ©ressant de flexibilitĂ©. Notre recherche se concentre donc sur lâamĂ©lioration du dĂ©bit des PFs dans un environnement de conception de ASIP.
Une approche gĂ©nĂ©rale est tout dâabord proposĂ©e pour caractĂ©riser la complexitĂ© computationnelle des PFs. Puisque les PFs peuvent ĂȘtre utilisĂ©s dans une vaste gamme dâapplications, nous utilisons deux types de blocs afin de distinguer les propriĂ©tĂ©s des PFs. Le premier type est spĂ©cifique Ă lâapplication et le deuxiĂšme type est spĂ©cifique Ă lâalgorithme. Selon les rĂ©sultats de profilage, nous avons identifiĂ© que les blocs du calcul de la probabilitĂ© et du rĂ©Ă©chantillonnage sont les goulots dâĂ©tranglement principaux des blocs spĂ©cifiques Ă lâalgorithme. Nous explorons lâoptimisation de ces deux blocs aux niveaux algorithmique et architectural.
Le niveau algorithmique offre un grand potentiel dâaccĂ©lĂ©ration et dâamĂ©lioration du dĂ©bit. Notre travail dĂ©bute donc Ă ce niveau par lâanalyse de la complexitĂ© des blocs du calcul de la probabilitĂ© et du rĂ©Ă©chantillonnage, puis continue avec leur simplification et modification. Nous avons simplifiĂ© le bloc du calcul de la probabilitĂ© en proposant un mĂ©canisme de quantification uniforme, lâalgorithme UQLE. Les rĂ©sultats dĂ©montrent une amĂ©lioration significative dâune implĂ©mentation logicielle, sans perte de prĂ©cision. Le pire cas de lâalgorithme UQLE implĂ©mentĂ© en logiciel Ă virgule fixe avec 32 niveaux de quantification atteint une accĂ©lĂ©ration moyenne de 23.7Ă par rapport Ă lâimplĂ©mentation logicielle de lâalgorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rĂ©Ă©chantillonnage pour remplacer lâalgorithme sĂ©quentiel de rĂ©Ă©chantillonnage systĂ©matique (SR) dans les PFs. Ce sont lâalgorithme SR reformulĂ© et lâalgorithme SR parallĂšle (PSR). Lâalgorithme SR reformulĂ© combine un groupe de boucles en une boucle unique afin de faciliter sa parallĂ©lisation dans un ASIP. Lâalgorithme PSR rend les itĂ©rations indĂ©pendantes, permettant ainsi Ă lâalgorithme de rĂ©Ă©chantillonnage de sâexĂ©cuter en parallĂšle. De plus, lâalgorithme PSR a une complexitĂ© computationnelle plus faible que lâalgorithme SR.
Du point de vue architectural, les ASIPs offrent un grand potentiel pour lâimplĂ©mentation de PFs parce quâils prĂ©sentent un bon Ă©quilibre entre lâefficacitĂ© computationnelle et la flexibilitĂ© de conception. Ils permettent des amĂ©liorations considĂ©rables en dĂ©bit par lâinclusion dâinstructions spĂ©cialisĂ©es, tout en conservant la facilitĂ© relative de programmation de processeurs Ă usage gĂ©nĂ©ral. AprĂšs avoir identifiĂ© les goulots dâĂ©tranglement de PFs dans les blocs spĂ©cifiques Ă lâalgorithme, nous avons gĂ©nĂ©rĂ© des instructions spĂ©cialisĂ©es pour les algorithmes UQLE, SR reformulĂ© et PSR. Le dĂ©bit a Ă©tĂ© significativement amĂ©liorĂ© par rapport Ă une implĂ©mentation purement logicielle tournant sur un processeur Ă usage gĂ©nĂ©ral. LâimplĂ©mentation de lâalgorithme UQLE avec instruction spĂ©cialisĂ©e avec 32 intervalles atteint une accĂ©lĂ©ration de 34Ă par rapport au pire cas de son implĂ©mentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implĂ©mentation de lâalgorithme SR reformulĂ©, avec quatre poids calculĂ©s en parallĂšle et huit catĂ©gories dĂ©finies par des bornes uniformĂ©ment distribuĂ©es qui sont comparĂ©es simultanĂ©ment. Elle atteint une accĂ©lĂ©ration de 23.9Ă par rapport Ă lâalgorithme SR sĂ©quentiel dans un processeur Ă usage gĂ©nĂ©ral. Le surcoĂ»t est limitĂ© Ă 54 K portes logiques additionnelles. Pour lâalgorithme PSR, nous avons conçu quatre instructions spĂ©cialisĂ©es configurĂ©es pour supporter quatre poids entrĂ©s en parallĂšle. Elles mĂšnent Ă une accĂ©lĂ©ration de 53.4Ă par rapport Ă une implĂ©mentation de lâalgorithme SR en virgule flottante sur un processeur Ă usage gĂ©nĂ©ral, avec un surcoĂ»t de 47.3 K portes logiques additionnelles.
Finalement, nous avons considĂ©rĂ© une application du suivi vidĂ©o et implĂ©mentĂ© dans un ASIP un algorithme de FP basĂ© sur un histogramme. Nous avons identifiĂ© le calcul de lâhistogramme comme Ă©tant le goulot principal des blocs spĂ©cifiques Ă lâapplication. Nous avons donc proposĂ© une architecture de calcul dâhistogramme Ă rĂ©seau parallĂšle (PAHA) pour ASIPs. Les rĂ©sultats dâimplĂ©mentation dĂ©montrent quâun PAHA Ă 16 voies atteint une accĂ©lĂ©ration de 43.75Ă par rapport Ă une implĂ©mentation logicielle sur un processeur Ă usage gĂ©nĂ©ral.----------ABSTRACT
This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment.
A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels.
The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7Ă average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm.
At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34Ă speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9Ă speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4Ă speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates.
Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75Ă when compared to its software implementation in a general-purpose processor
Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles
RĂSUMĂ De nos jours, les industriels privilĂ©gient les architectures flexibles afin de rĂ©duire le temps et les coĂ»ts de conception dâun systĂšme. Les processeurs Ă usage spĂ©cifique (ASIP) fournissent beaucoup de flexibilitĂ©, tout en atteignant des performances Ă©levĂ©es. Une tendance qui a de plus en plus de succĂšs dans le processus de conception dâun systĂšme sur puce consiste Ă spĂ©cifier le comportement du systĂšme en langage Ă©voluĂ© tel que le C, SystemC, etc. La spĂ©cification est ensuite utilisĂ©e durant le partitionement pour dĂ©terminer les composantes logicielles et matĂ©rielles du systĂšme. Avec la maturitĂ© des gĂ©nĂ©rateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boĂźtes Ă outils un nouveau type dâarchitecture, Ă savoir les ASIP, en sachant que ces derniers sont conçus Ă partir dâune spĂ©cification dĂ©crite en langage Ă©voluĂ©. Dâun autre cĂŽtĂ©, dans le monde matĂ©riel, et cela depuis trĂšs longtemps, les chercheurs ont vu lâavantage de baser le processus de conception sur un langage Ă©voluĂ©. Cette recherche a abouti Ă lâavĂ©nement de gĂ©nĂ©rateurs automatiques de matĂ©riel sur le marchĂ© qui sont des outils dâaide Ă la conception comme CapatultC, Forteâs Cynthetizer, etc. Ainsi, avec tous ces outils basĂ©s sur le langage C, les concepteurs ont un choix de types de design Ă©largi mais, dâun autre cĂŽtĂ©, les options de designs possibles explosent, ce qui peut allonger au lieu de rĂ©duire le temps de conception. Câest dans ce cadre que notre thĂšse doctorale sâinscrit, puisquâelle prĂ©sente des mĂ©thodologies dâexploration architecturale de design Ă usage spĂ©cifique pour lâaccĂ©lĂ©ration de boucles afin de rĂ©duire le temps de conception, entre autres.
Cette thĂšse a dĂ©butĂ© par lâexploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates Ă lâaccĂ©lĂ©ration, si elles comportent de bonnes possibilitĂ©s de parallĂ©lisme et si ces derniĂšres sont bien exploitĂ©es. Le matĂ©riel est trĂšs efficace Ă profiter des possibilitĂ©s de parallĂ©lisme au niveau instruction, donc, une mĂ©thode de conception a Ă©tĂ© proposĂ©e. Cette derniĂšre extrait le parallĂ©lisme dâune boucle afin dâexĂ©cuter plus dâopĂ©rations concurrentes dans des instructions spĂ©cialisĂ©es. Notre mĂ©thode se base aussi sur lâoptimisation des donnĂ©es dans lâarchitecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forteâs Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances.
Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation
Generation of Graph Classes with Efficient Isomorph Rejection
In this thesis, efficient isomorph-free generation of graph classes with the method of
generation by canonical construction path(GCCP) is discussed. The method GCCP
has been invented by McKay in the 1980s. It is a general method to recursively generate
combinatorial objects avoiding isomorphic copies. In the introduction chapter, the
method of GCCP is discussed and is compared to other well-known methods of generation.
The generation of the class of quartic graphs is used as an example to explain
this method. Quartic graphs are simple regular graphs of degree four. The programs,
we developed based on GCCP, generate quartic graphs with 18 vertices more than two
times as efficiently as the well-known software GENREG does.
This thesis also demonstrates how the class of principal graph pairs can be generated
exhaustively in an efficient way using the method of GCCP. The definition and
importance of principal graph pairs come from the theory of subfactors where each
subfactor can be modelled as a principal graph pair. The theory of subfactors has
applications in the theory of von Neumann algebras, operator algebras, quantum algebras
and Knot theory as well as in design of quantum computers. While it was
initially expected that the classification at index 3 + â5 would be very complicated,
using GCCP to exhaustively generate principal graph pairs was critical in completing
the classification of small index subfactors to index 5Œ.
The other set of classes of graphs considered in this thesis contains graphs without
a given set of cycles. For a given set of graphs, H, the TurĂĄn Number of H, ex(n,H),
is defined to be the maximum number of edges in a graph on n vertices without a
subgraph isomorphic to any graph in H. Denote by EX(n,H), the set of all extremal
graphs with respect to n and H, i.e., graphs with n vertices, ex(n,H) edges and no
subgraph isomorphic to any graph in H. We consider this problem when H is a set of
cycles. New results for ex(n, C) and EX(n, C) are introduced using a set of algorithms
based on the method of GCCP. Let K be an arbitrary subset of {C3, C4, C5, . . . , C32}.
For given n and a set of cycles, C, these algorithms can be used to calculate ex(n, C)
and extremal graphs in Ex(n, C) by recursively extending smaller graphs without any
cycle in C where C = K or C = {C3, C5, C7, . . .} ᎠK and nâ€64. These results are
considerably in excess of the previous results of the many researchers who worked on
similar problems. In the last chapter, a new class of canonical relabellings for graphs, hierarchical
canonical labelling, is introduced in which if the vertices of a graph, G, is canonically
labelled by {1, . . . , n}, then G\{n} is also canonically labelled. An efficient hierarchical
canonical labelling is presented and the application of this labelling in generation
of combinatorial objects is discussed