40 research outputs found

    VLSI design of configurable low-power coarse-grained array architecture

    Get PDF
    Biomedical signal acquisition from in- or on-body sensors often requires local (on-node) low-level pre-processing before the data are sent to a remote node for aggregation and further processing. Local processing is required for many different operations, which include signal cleanup (noise removal), sensor calibration, event detection and data compression. In this environment, processing is subject to aggressive energy consumption restrictions, while often operating under real-time requirements. These conflicting requirements impose the use of dedicated circuits addressing a very specific task or the use of domain-specific customization to obtain significant gains in power efficiency. However, economic and time-to-market constraints often make the development or use of application-specific platforms very risky.One way to address these challenges is to develop a sensor node with a general-purpose architecture combining a low-power, low-performance general microprocessor or micro-controller with a coarse-grained reconfigurable array (CGRA) acting as an accelerator. A CGRA consists of a fixed number of processing units (e.g., ALUs) whose function and interconnections are determined by some configuration data.The objective of this work is to create an RTL-level description of a low-power CGRA of ALUs and produce a low-power VLSI (standard cell) implementation, that supports power-saving features.The CGRA implementation should use as few resources as possible and fully exploit the intended operation environment. The design will be evaluated with a set of simple signal processing task

    Hardware / Software Architectural and Technological Exploration for Energy-Efficient and Reliable Biomedical Devices

    Get PDF
    Nowadays, the ubiquity of smart appliances in our everyday lives is increasingly strengthening the links between humans and machines. Beyond making our lives easier and more convenient, smart devices are now playing an important role in personalized healthcare delivery. This technological breakthrough is particularly relevant in a world where population aging and unhealthy habits have made non-communicable diseases the first leading cause of death worldwide according to international public health organizations. In this context, smart health monitoring systems termed Wireless Body Sensor Nodes (WBSNs), represent a paradigm shift in the healthcare landscape by greatly lowering the cost of long-term monitoring of chronic diseases, as well as improving patients' lifestyles. WBSNs are able to autonomously acquire biological signals and embed on-node Digital Signal Processing (DSP) capabilities to deliver clinically-accurate health diagnoses in real-time, even outside of a hospital environment. Energy efficiency and reliability are fundamental requirements for WBSNs, since they must operate for extended periods of time, while relying on compact batteries. These constraints, in turn, impose carefully designed hardware and software architectures for hosting the execution of complex biomedical applications. In this thesis, I develop and explore novel solutions at the architectural and technological level of the integrated circuit design domain, to enhance the energy efficiency and reliability of current WBSNs. Firstly, following a top-down approach driven by the characteristics of biomedical algorithms, I perform an architectural exploration of a heterogeneous and reconfigurable computing platform devoted to bio-signal analysis. By interfacing a shared Coarse-Grained Reconfigurable Array (CGRA) accelerator, this domain-specific platform can achieve higher performance and energy savings, beyond the capabilities offered by a baseline multi-processor system. More precisely, I propose three CGRA architectures, each contributing differently to the maximization of the application parallelization. The proposed Single, Multi and Interleaved-Datapath CGRA designs allow the developed platform to achieve substantial energy savings of up to 37%, when executing complex biomedical applications, with respect to a multi-core-only platform. Secondly, I investigate how the modeling of technology reliability issues in logic and memory components can be exploited to adequately adjust the frequency and supply voltage of a circuit, with the aim of optimizing its computing performance and energy efficiency. To this end, I propose a novel framework for workload-dependent Bias Temperature Instability (BTI) impact analysis on biomedical application results quality. Remarkably, the framework is able to determine the range of safe circuit operating frequencies without introducing worst-case guard bands. Experiments highlight the possibility to safely raise the frequency up to 101% above the maximum obtained with the classical static timing analysis. Finally, through the study of several well-known biomedical algorithms, I propose an approach allowing energy savings by dynamically and unequally protecting an under-powered data memory in a new way compared to regular error protection schemes. This solution relies on the Dynamic eRror compEnsation And Masking (DREAM) technique that reduces by approximately 21% the energy consumed by traditional error correction codes

    Just In Time Assembly (JITA) - A Run Time Interpretation Approach for Achieving Productivity of Creating Custom Accelerators in FPGAs

    Get PDF
    The reconfigurable computing community has yet to be successful in allowing programmers to access FPGAs through traditional software development flows. Existing barriers that prevent programmers from using FPGAs include: 1) knowledge of hardware programming models, 2) the need to work within the vendor specific CAD tools and hardware synthesis. This thesis presents a series of published papers that explore different aspects of a new approach being developed to remove the barriers and enable programmers to compile accelerators on next generation reconfigurable manycore architectures. The approach is entitled Just In Time Assembly (JITA) of hardware accelerators. The approach has been defined to allow hardware accelerators to be built and run through software compilation and run time interpretation outside of CAD tools and without requiring each new accelerator to be synthesized. The approach advocates the use of libraries of pre-synthesized components that can be referenced through symbolic links in a similar fashion to dynamically linked software libraries. Synthesis still must occur but is moved out of the application programmers software flow and into the initial coding process that occurs when programming patterns that define a Domain Specific Language (DSL) are first coded. Programmers see no difference between creating software or hardware functionality when using the DSL. A new run time interpreter is introduced to assemble the individual pre-synthesized hardware accelerators that comprise the accelerator functionality within a configurable tile array of partially reconfigurable slots at run time. Quantitative results are presented that compares utilization, performance, and productivity of the approach to what would be achieved by full custom accelerators created through traditional CAD flows using hardware programming models and passing through synthesis

    Metoda projektovanja namenskih programabilnih hardverskih akceleratora

    Get PDF
    Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti, preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora. Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim, ovo ima negativne posledice na efikasnost sistema. često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja, neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije, pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target applications. To efficiently execute those applications, they may employ Application Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal when designing ISEs is efficiency, i.e. the highest possible speedup, which implies synthesizing all critical computational kernels of the application dataflow graphs as an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip area dedicated solely to ISEs. One existing approach to decrease this area by paying a reasonable price of decreased efficiency is to perform datapath merging on input dataflow graphs (DFGs) prior to generating the ASIC. It is often neglected that even higher costs can be accidentally incurred due to the lack of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen, significant time-to-market delays and nonrecurrent costs for redesigning the ISEs and the corresponding ASIPs become inevitable. This thesis shows that flexibility and efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a limited amount of hardware flexibility during the design process, such that the resulting datapath is in fact reconfigurable and thus can execute not only the applications known at design time, but also other applications belonging to the same application-domain. In other words, it proposes a methodology for designing domain-specific reconfigurable arrays out of a limited set of input applications. The experimental results show that resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using datapath merging, which have practically null flexibility beyond the design set of DFGs

    Équifinalité, incertitude et procédures multi-modèle en prévision hydrologique aux sites non-jaugés

    Get PDF
    L’objectif du présent projet de recherche est d’analyser diverses méthodes de prévision hydrologique sur des bassins versants non-jaugés et d'en mieux comprendre les limitations afin d’y apporter des améliorations. Ce projet est divisé en trois parties distinctes, chacune ayant contribué à au moins un article scientifique parmi les sept contenus dans cette thèse. La première partie consiste en la réduction de l’incertitude paramétrique en améliorant le calage des modèles hydrologiques. Une comparaison et une analyse de 10 méthodes d’optimisation automatiques a permis d’identifier les algorithmes les mieux adaptés aux problèmes de calage associés à ce projet. Cela a permis d’identifier des jeux de paramètres plus performants et plus robustes, ce qui s’avère être une hypothèse forte dans certaines méthodes de régionalisation. Dans la même optique, des méthodes de réduction paramétrique ont été testées et une approche de fixage de paramètres par analyse de sensibilité globale a été sélectionnée. Il a été démontré que le modèle hydrologique HSAMI pouvait, avec entre 8 et 11 paramètres fixés sur 23, conserver sa performance en validation ou en régionalisation. Cependant, malgré la réduction de l’équifinalité et l’amélioration de l’identification paramétrique, fixer des paramètres ne contribue pas à augmenter la performance des méthodes de régionalisation. Cette conclusion va à l’encontre du concept de parcimonie généralement accepté dans la communauté scientifique. La seconde partie du projet touche la modélisation multi-modèle, où il a été montré que de pondérer les hydrogrammes simulés de plusieurs modèles améliore la performance de manière significative par rapport au meilleur modèle individuel. Une comparaison des approches classiques de pondération multi-modèle ainsi que le développement d’une nouvelle méthode a permis de sélectionner le meilleur outil pour les besoins du projet. Deux essais ont été effectués en multi-modèle. Premièrement, l’approche multi-modèle a été appliquée en régionalisation avec trois modèles hydrologiques. Cependant, il a été démontré que la robustesse de deux des modèles était insuffisante pour améliorer les performances du troisième. Enfin, une approche novatrice multi-modèle dans laquelle un modèle est lancé avec plusieurs sources de données différentes a montré sa capacité à réduire les erreurs liées aux données météorologiques en simulation. La dernière partie du projet est directement liée à la régionalisation. Il y a été démontré que l’équifinalité paramétrique n’influence que très peu la qualité des prévisions aux sites nonjaugés et que la performance du modèle est plus importante que l’identifiabilité des paramètres, du moins, pour le modèle hydrologique utilisé ici. Une nouvelle approche hybride de régionalisation a également été proposée. Cette nouvelle approche a montré une meilleure performance que les approches classiques. Finalement, les approches de régionalisation ont été mises à l’épreuve dans un laboratoire numérique issu d’un modèle régional de climat. Ceci a permis de les analyser dans un environnement exempt d’erreurs de mesure sur les données météorologiques et physiographiques. La cohérence physique entre la météorologie et l’hydrologie de ce monde virtuel était également respectée. Il a été démontré que les descripteurs physiques ne sont pas suffisants pour prévoir la capacité d’une méthode de régionalisation à bien fonctionner, en plus de montrer que les incertitudes liées à ces mesures sont moins importantes que ce qui rapporté dans la littérature

    FDSOI Design using Automated Standard-Cell-Grained Body Biasing

    Get PDF
    With the introduction of FDSOI processes at competitive technology nodes, body biasing on an unprecedented scale was made possible. Body biasing influences one of the central transistor characteristics, the threshold voltage. By being able to heighten or lower threshold voltage by more than 100mV, the very physics of transistor switching can be manipulated at run time. Furthermore, as body biasing does not lead to different signal levels, it can be applied much more fine-grained than, e.g., DVFS. With the state of the art mainly focused on combinations of body biasing with DVFS, it has thus ignored granularities unfeasible for DVFS. This thesis fills this gap by proposing body bias domain partitioning techniques and for body bias domain partitionings thereby generated, algorithms that search for body bias assignments. Several different granularities ranging from entire cores to small groups of standard cells were examined using two principal approaches: Designer aided pre-partitioning based determination of body bias domains and a first-time, fully automatized, netlist based approach called domain candidate exploration. Both approaches operate along the lines of activation and timing of standard cell groups. These approaches were evaluated using the example of a Dynamically Reconfigurable Processor (DRP), a highly efficient category of reconfigurable architectures which consists of an array of processing elements and thus offers many opportunities for generalization towards many-core architectures. Finally, the proposed methods were validated by manufacturing a test-chip. Extensive simulation runs as well as the test-chip evaluation showed the validity of the proposed methods and indicated substantial improvements in energy efficiency compared to the state of the art. These improvements were accomplished by the fine-grained partitioning of the DRP design. This method allowed reducing dynamic power through supply voltage levels yielding higher clock frequencies using forward body biasing, while simultaneously reducing static power consumption in unused parts.Die Einführung von FDSOI Prozessen in gegenwärtigen Prozessgrößen ermöglichte die Nutzung von Substratvorspannung in nie zuvor dagewesenem Umfang. Substratvorspannung beeinflusst unter anderem eine zentrale Eigenschaft von Transistoren, die Schwellspannung. Mittels Substratvorspannung kann diese um mehr als 100mV erhöht oder gesenkt werden, was es ermöglicht, die schiere Physik des Schaltvorgangs zu manipulieren. Da weiterhin hiervon der Signalpegel der digitalen Signale unberührt bleibt, kann diese Technik auch in feineren Granularitäten angewendet werden, als z.B. Dynamische Spannungs- und Frequenz Anpassung (Engl. Dynamic Voltage and Frequency Scaling, Abk. DVFS). Da jedoch der Stand der Technik Substratvorspannung hauptsächlich in Kombinationen mit DVFS anwendet, werden feinere Granularitäten, welche für DVFS nicht mehr wirtschaftlich realisierbar sind, nicht berücksichtigt. Die vorliegende Arbeit schließt diese Lücke, indem sie Partitionierungsalgorithmen zur Unterteilung eines Entwurfs in Substratvorspannungsdomänen vorschlägt und für diese hierdurch unterteilten Domänen entsprechende Substratvorspannungen berechnet. Hierzu wurden verschiedene Granularitäten berücksichtigt, von ganzen Prozessorkernen bis hin zu kleinen Gruppen von Standardzellen. Diese Entwürfe wurden dann mit zwei verschiedenen Herangehensweisen unterteilt: Chipdesigner unterstützte, vorpartitionierungsbasierte Bestimmung von Substratvorspannungsdomänen, sowie ein erstmals vollautomatisierter, Netzlisten basierter Ansatz, in dieser Arbeit Domänen Kandidaten Exploration genannt. Beide Ansätze funktionieren nach dem Prinzip der Aktivierung, d.h. zu welchem Zeitpunkt welcher Teil des Entwurfs aktiv ist, sowie der Signallaufzeit durch die entsprechenden Entwurfsteile. Diese Ansätze wurden anhand des Beispiels Dynamisch Rekonfigurierbarer Prozessoren (DRP) evaluiert. DRPs stellen eine Klasse hocheffizienter rekonfigurierbarer Architekturen dar, welche hauptsächlich aus einem Feld von Rechenelementen besteht und dadurch auch zahlreiche Möglichkeiten zur Verallgemeinerung hinsichtlich Many-Core Architekturen zulässt. Schließlich wurden die vorgeschlagenen Methoden in einem Testchip validiert. Alle ermittelten Ergebnisse zeigen im Vergleich zum Stand der Technik drastische Verbesserungen der Energieeffizienz, welche durch die feingranulare Unterteilung in Substratvorspannungsdomänen erzielt wurde. Hierdurch konnten durch die Anwendung von Substratvorspannung höhere Taktfrequenzen bei gleicher Versorgungsspannung erzielt werden, während zeitgleich in zeitlich unkritischen oder ungenutzten Entwurfsteilen die statische Leistungsaufnahme minimiert wurde

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
    corecore