38 research outputs found

    Real-Time neural signal decoding on heterogeneous MPSocs based on VLIW ASIPs

    Get PDF
    An important research problem, at the basis of the development of embedded systems for neuroprosthetic applications, is the development of algorithms and platforms able to extract the patient's motion intention by decoding the information encoded in neural signals. At the state of the art, no portable and reliable integrated solutions implementing such a decoding task have been identified. To this aim, in this paper, we investigate the possibility of using the MPSoC paradigm in this application domain. We perform a design space exploration that compares different custom MPSoC embedded architectures, implementing two versions of a on-line neural signal decoding algorithm, respectively targeting decoding of single and multiple acquisition channels. Each considered design points features a different application configuration, with a specific partitioning and mapping of parallel software tasks, executed on customized VLIW ASIP processing cores. Experimental results, obtained by means of FPGA-based prototyping and post-floorplanning power evaluation on a 40nm technology library, assess the performance and hardware-related costs of the considered configurations. The reported power figures demonstrate the usability of the MPSoC paradigm within the processing of bio-electrical signals and show the benefits achievable by the exploitation of the instruction-level parallelism within tasks

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    A Probabilistic Approach for the System-Level Design of Multi-ASIP Platforms

    Get PDF

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

    Low power architectures for streaming applications

    Get PDF

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Design of application-specific instruction set processors with asynchronous methodology for embedded digital signal processing applications.

    Get PDF
    Kwok Yan-lun Andy.Thesis submitted in: November 2004.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 133-137).Abstracts in English and Chinese.Abstract --- p.i摘芁 --- p.iiAcknowledgements --- p.iiiList of Figures --- p.viiList of Tables and Examples --- p.xChapter 1. --- Introduction --- p.1Chapter 1.1. --- Motivation --- p.1Chapter 1.2. --- Objective and Approach --- p.4Chapter 1.3. --- Thesis Organization --- p.5Chapter 2. --- Related Work --- p.7Chapter 2.1. --- Coverage --- p.7Chapter 2.2. --- ASIP Design Methodologies --- p.8Chapter 2.3. --- Asynchronous Technology on Processors --- p.12Chapter 2.4. --- Summary --- p.14Chapter 3. --- Asynchronous Design Methodology --- p.15Chapter 3.1. --- Overview --- p.15Chapter 3.2. --- Asynchronous Design Style --- p.17Chapter 3.2.1. --- Micropipelines --- p.17Chapter 3.2.2. --- Fine-grain Pipelining --- p.20Chapter 3.2.3. --- Globally-Asynchronous Locally-Synchronous (GALS) Design --- p.22Chapter 3.3. --- Advantages of GALS in ASIP Design --- p.27Chapter 3.3.1. --- Reuse of Synchronous and Asynchronous IP --- p.27Chapter 3.3.2. --- Fine Tuning of Performance and Power Consumption --- p.27Chapter 3.3.3. --- Synthesis-based Design Flow --- p.28Chapter 3.4. --- Design of GALS Asynchronous Wrapper --- p.28Chapter 3.4.1. --- Handshake Protocol --- p.28Chapter 3.4.2. --- Pausible Clock Generator --- p.29Chapter 3.4.3. --- Port Controllers --- p.30Chapter 3.4.4. --- Performance of the Asynchronous Wrapper --- p.33Chapter 3.5. --- Summary --- p.35Chapter 4. --- Platform Based ASIP Design Methodology --- p.36Chapter 4.1. --- Platform Based Approach --- p.36Chapter 4.1.1. --- The Definition of Our Platform --- p.37Chapter 4.1.2. --- The Definition of the Platform Based Design --- p.37Chapter 4.2. --- Platform Architecture --- p.38Chapter 4.2.1. --- The Nature of DSP Algorithms --- p.38Chapter 4.2.2. --- Design Space of Datapath Optimization --- p.46Chapter 4.2.3. --- Proposed Architecture --- p.49Chapter 4.2.4. --- The Strategy of Realizing an Optimized Datapath --- p.51Chapter 4.2.5. --- Pipeline Organization --- p.59Chapter 4.2.6. --- GALS Partitioning --- p.61Chapter 4.2.7. --- Operation Mechanism --- p.63Chapter 4.3. --- Overall Design Flow --- p.67Chapter 4.4. --- Summary --- p.70Chapter 5. --- Design of the ASIP Platform --- p.72Chapter 5.1. --- Design Goal --- p.72Chapter 5.2. --- Instruction Fetch --- p.74Chapter 5.2.1. --- Instruction fetch unit --- p.74Chapter 5.2.2. --- Zero-overhead loops and Subroutines --- p.75Chapter 5.3. --- Instruction Decode --- p.77Chapter 5.3.1. --- Instruction decoder --- p.77Chapter 5.3.2. --- The Encoding of Parallel and Complex Instructions --- p.80Chapter 5.4. --- Datapath --- p.81Chapter 5.4.1. --- Base Functional Units --- p.81Chapter 5.4.2. --- Functional Unit Wrapper Interface --- p.83Chapter 5.5. --- Register File Systems --- p.84Chapter 5.5.1. --- Memory Hierarchy --- p.84Chapter 5.5.2. --- Register File Organization --- p.85Chapter 5.5.3. --- Address Generation --- p.93Chapter 5.5.4. --- Load and Store --- p.98Chapter 5.6. --- Design Verification --- p.100Chapter 5.7. --- Summary --- p.104Chapter 6. --- Case Studies --- p.105Chapter 6.1. --- Objective --- p.105Chapter 6.2. --- Approach --- p.105Chapter 6.3. --- Based versus Optimized --- p.106Chapter 6.3.1. --- Matrix Manipulation --- p.106Chapter 6.3.2. --- Autocorrelation --- p.109Chapter 6.3.3. --- CORDIC --- p.110Chapter 6.4. --- Optimized versus Advanced Commercial DSPs --- p.113Chapter 6.4.1. --- Introduction to TMS320C62x and SC140 --- p.113Chapter 6.4.2. --- Results --- p.115Chapter 6.5. --- Summary --- p.116Chapter 7. --- Conclusion --- p.118Chapter 7.1. --- When ASIPs encounter asynchronous --- p.118Chapter 7.2. --- Contributions --- p.120Chapter 7.3. --- Future Directions --- p.121Chapter A --- Synthesis of Extended Burst-Mode Asynchronous Finite State Machine --- p.122Chapter B --- Base Instruction Set --- p.124Chapter C --- Special Registers --- p.127Chapter D --- Synthesizable Model of GALS Wrapper --- p.130Reference --- p.13

    Floating Point Arithmetic for Transport Triggered Architectures

    Get PDF
    LaskentajÀrjestelmiin kohdistuu usein suorituskyky- ja virrankulutusvaatimuksia, joita ei pystytÀ saavuttamaan yleiskÀyttöisellÀ prosessorilla. Toistaalta laitteistokiihdyttimien suunnittelu voi vaatia kohtuuttoman paljon työaikaa. Ongelmaa voidaan lÀhestyÀ kÀyttÀmÀllÀ sovellusta varten rÀÀtÀlöityÀ sovelluskohtaista kÀskykantaprosessoria (Application-Specific Instruction set Processor, ASIP), joka on kuitenkin ohjelmoitava. Prosessorin rÀÀtÀlöinnin tÀytyy olla pitkÀlle automatisoitua sÀÀstÀÀkseen kustannuksia. TTA-based Codesign Environment (TCE) on siirtoliipaistuun prosessoriarkkitehtuuriin (Transport Triggered Architecture, TTA) perustuva ASIP-kehitysympÀristö. TTA on arkkitehtuurina helposti rÀÀtÀlöitÀvÀ ja joustaa pienistÀ ytimistÀ suuritehoisiin pitkÀn kÀskysanan suorittimiin. Useat tieteellisen laskennan ja signaalinkÀsittelyn sovellukset, joissa TTA:n skaalautuvuudesta ja kÀskytason rinnakkaisuudesta olisi erityistÀ hyötyÀ, vaativat tuen laitteistokiihdytetylle liukulukulaskennalle. TÀssÀ diplomityössÀ suunniteltiin ja toteutettiin TCE-projektia varten sarja liukulukuyksiköitÀ. Yksiköiden suunnittelussa pyrittiin alustariippumattomuuteen sekÀ korkeaan suorituskykyyn Field Programmable Gate Array alustoilla (FPGA) jopa tinkimÀllÀ tuetusta liukulukustandardista. Yksiköt sisÀltÀvÀt työkalut puolen tarkkuuden liukulukulaskentaan. LisÀksi työssÀ esitetÀÀn erikoiskÀskyihin perustuvat nopeat algoritmit liukulukujakolaskun ja -neliöjuuren laskentaan. Yksiköiden toiminta varmistettiin automaattisella rekisterisiirtotason (Register Transfer Level, RTL) testipenkillÀ. Vertailussa Altera Stratix-II-FPGA:lla yksiköt pÀÀsivÀt lÀhelle Alteran omien liukulukuyksiköiden suorituskykyÀ. Uudemmalla Xilinx Virtex-6-FPGA:lla korkein mahdollinen suorituskyky vaatisi tiheÀmpÀÀ liukuhihnoitusta

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numĂ©rique, un nombre fractionnaire peut ĂȘtre reprĂ©sentĂ© Ă  l’aide de l’arithmĂ©tique en virgule fixe ou en virgule flottante. L’arithmĂ©tique en virgule fixe est largement considĂ©rĂ©e prĂ©fĂ©rable Ă  celle en virgule flottante pour les architectures matĂ©rielles dĂ©diĂ©es en raison de sa plus faible complexitĂ© d’implĂ©mentation. Dans la mise en Ɠuvre du matĂ©riel, la largeur de mot attribuĂ©e Ă  diffĂ©rents signaux a un impact significatif sur des mĂ©triques telles que les ressources (transistors), la vitesse et la consommation d'Ă©nergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise Ă  optimiser les chemins de donnĂ©es par l'ajustement des longueurs de mots attribuĂ©es aux signaux. Un nombre en virgule fixe est composĂ© d’une partie entiĂšre et d’une partie fractionnaire. Il y a une limite infĂ©rieure au nombre de bits allouĂ©s Ă  la partie entiĂšre, de façon Ă  prĂ©venir les dĂ©bordements pour chaque signal. Cette limite dĂ©pend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant Ă  lui, dĂ©termine la taille de l'erreur de prĂ©cision finie qui est introduite dans les calculs. Il existe un compromis entre la prĂ©cision et l'efficacitĂ© du matĂ©riel dans la sĂ©lection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procĂ©dures importantes: la modĂ©lisation de l'erreur de quantification et la sĂ©lection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont portĂ© sur des circuits spĂ©cialisĂ©s comme plate-forme cible. Dans cette thĂšse, nous introduisons de nouvelles mĂ©thodologies, techniques et algorithmes pour amĂ©liorer l’implĂ©mentation de calculs en virgule fixe dans des circuits et processeurs spĂ©cialisĂ©s. La thĂšse propose une approche amĂ©liorĂ©e de modĂ©lisation d’erreur, basĂ©e sur l'arithmĂ©tique affine, qui aborde certains problĂšmes des mĂ©thodes existantes et amĂ©liore leur prĂ©cision. La thĂšse introduit Ă©galement une technique d'accĂ©lĂ©ration et deux algorithmes semi-analytiques pour la sĂ©lection de la largeur de la partie fractionnaire pour la conception de circuits spĂ©cialisĂ©s. Alors que le premier algorithme suit une stratĂ©gie de recherche progressive, le second utilise une mĂ©thode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexitĂ© de calcul et le coĂ»t rĂ©sultant. Le premier algorithme a une complexitĂ© polynomiale et obtient des rĂ©sultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexitĂ© exponentielle, mais il donne des rĂ©sultats quasi-optimaux par rapport Ă  une recherche exhaustive. Cette thĂšse propose Ă©galement une mĂ©thode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unitĂ©s fonctionnelles sont les principaux objectifs ciblĂ©s par cette optimisation. Un nouvel algorithme d'optimisation a Ă©tĂ© dĂ©veloppĂ© pour trouver la meilleure combinaison de longueurs de mots et d'autres paramĂštres configurables dans la mĂ©thode proposĂ©e. Les exigences de prĂ©cision, dĂ©finies comme l'erreur pire cas, doivent ĂȘtre respectĂ©es par toute solution. Pour faciliter l'Ă©valuation et la mise en Ɠuvre des solutions retenues, un nouvel environnement de conception de processeur a Ă©galement Ă©tĂ© dĂ©veloppĂ©. Cet environnement, qui est appelĂ© PolyCuSP, supporte une large gamme de paramĂštres, y compris ceux qui sont nĂ©cessaires pour Ă©valuer les solutions proposĂ©es par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacitĂ© de modĂ©liser diffĂ©rents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari
    corecore