82 research outputs found
Architectural Exploration of KeyRing Self-Timed Processors
RÉSUMÉ
Les dernières décennies ont vu l’augmentation des performances des processeurs contraintes
par les limites imposées par la consommation d’énergie des systèmes électroniques : des très
basses consommations requises pour les objets connectés, aux budgets de dépenses électriques
des serveurs, en passant par les limitations thermiques et la durée de vie des batteries des
appareils mobiles. Cette forte demande en processeurs efficients en énergie, couplée avec
les limitations de la réduction d’échelle des transistors—qui ne permet plus d’améliorer les
performances à densité de puissance constante—, conduit les concepteurs de circuits intégrés
à explorer de nouvelles microarchitectures permettant d’obtenir de meilleures performances
pour un budget énergétique donné. Cette thèse s’inscrit dans cette tendance en proposant
une nouvelle microarchitecture de processeur, appelée KeyRing, conçue avec l’intention de
réduire la consommation d’énergie des processeurs.
La fréquence d’opération des transistors dans les circuits intégrés est proportionnelle à leur
consommation dynamique d’énergie. Par conséquent, les techniques de conception permettant
de réduire dynamiquement le nombre de transistors en opération sont très largement
adoptées pour améliorer l’efficience énergétique des processeurs. La technique de clock-gating
est particulièrement usitée dans les circuits synchrones, car elle réduit l’impact de l’horloge
globale, qui est la principale source d’activité. La microarchitecture KeyRing présentée dans
cette thèse utilise une méthode de synchronisation décentralisée et asynchrone pour réduire
l’activité des circuits. Elle est dérivée du processeur AnARM, un processeur développé par
Octasic sur la base d’une microarchitecture asynchrone ad hoc. Bien qu’il soit plus efficient
en énergie que des alternatives synchrones, le AnARM est essentiellement incompatible avec
les méthodes de synthèse et d’analyse temporelle statique standards. De plus, sa technique
de conception ad hoc ne s’inscrit que partiellement dans les paradigmes de conceptions asynchrones.
Cette thèse propose une approche rigoureuse pour définir les principes généraux
de cette technique de conception ad hoc, en faisant levier sur la littérature asynchrone. La
microarchitecture KeyRing qui en résulte est développée en association avec une méthode
de conception automatisée, qui permet de s’affranchir des incompatibilités natives existant
entre les outils de conception et les systèmes asynchrones. La méthode proposée permet de
pleinement mettre à profit les flots de conception standards de l’industrie microélectronique
pour réaliser la synthèse et la vérification des circuits KeyRing. Cette thèse propose également
des protocoles expérimentaux, dont le but est de renforcer la relation de causalité
entre la microarchitecture KeyRing et une réduction de la consommation énergétique des
processeurs, comparativement à des alternatives synchrones équivalentes.----------ABSTRACT
Over the last years, microprocessors have had to increase their performances while keeping
their power envelope within tight bounds, as dictated by the needs of various markets: from
the ultra-low power requirements of the IoT, to the electrical power consumption budget
in enterprise servers, by way of passive cooling and day-long battery life in mobile devices.
This high demand for power-efficient processors, coupled with the limitations of technology
scaling—which no longer provides improved performances at constant power densities—, is
leading designers to explore new microarchitectures with the goal of pulling more performances
out of a fixed power budget. This work enters into this trend by proposing a new
processor microarchitecture, called KeyRing, having a low-power design intent.
The switching activity of integrated circuits—i.e. transistors switching on and off—directly
affects their dynamic power consumption. Circuit-level design techniques such as clock-gating
are widely adopted as they dramatically reduce the impact of the global clock in synchronous
circuits, which constitutes the main source of switching activity. The KeyRing microarchitecture
presented in this work uses an asynchronous clocking scheme that relies on decentralized
synchronization mechanisms to reduce the switching activity of circuits. It is derived from
the AnARM, a power-efficient ARM processor developed by Octasic using an ad hoc asynchronous
microarchitecture. Although it delivers better power-efficiency than synchronous
alternatives, it is for the most part incompatible with standard timing-driven synthesis and
Static Timing Analysis (STA). In addition, its design style does not fit well within the existing
asynchronous design paradigms. This work lays the foundations for a more rigorous
definition of this rather unorthodox design style, using circuits and methods coming from the
asynchronous literature. The resulting KeyRing microarchitecture is developed in combination
with Electronic Design Automation (EDA) methods that alleviate incompatibility issues
related to ad hoc clocking, enabling timing-driven optimizations and verifications of KeyRing
circuits using industry-standard design flows. In addition to bridging the gap with standard
design practices, this work also proposes comprehensive experimental protocols that aims to
strengthen the causal relation between the reported asynchronous microarchitecture and a
reduced power consumption compared with synchronous alternatives.
The main achievement of this work is a framework that enables the architectural exploration
of circuits using the KeyRing microarchitecture
Development of Advanced Closed-Loop Brain Electrophysiology Systems for Freely Behaving Rodents
[ES] La electrofisiologÃa extracelular es una técnica ampliamente usada en investigación neurocientÃfica, la cual estudia el funcionamiento del cerebro mediante la medición de campos eléctricos generados por la actividad neuronal. Esto se realiza a través de electrodos implantados en el cerebro y conectados a dispositivos electrónicos para amplificación y digitalización de las señales. De los muchos modelos animales usados en experimentación, las ratas y los ratones se encuentran entre las especies más comúnmente utilizadas.
Actualmente, la experimentación electrofisiológica busca condiciones cada vez más complejas, limitadas por la tecnologÃa de los dispositivos de adquisición. Dos aspectos son de particular interés: Realimentación de lazo cerrado y comportamiento en condiciones naturales. En esta tesis se presentan desarrollos con el objetivo de mejorar diferentes facetas de estos dos problemas.
La realimentación en lazo cerrado se refiere a todas las técnicas en las que los estÃmulos son producidos en respuesta a un evento generado por el animal. La latencia debe ajustarse a las escalas temporales bajo estudio. Los sistemas modernos de adquisición presentan latencias en el orden de los 10ms. Sin embargo, para responder a eventos rápidos, como pueden ser los potenciales de acción, se requieren latencias por debajo de 1ms. Además, los algoritmos para detectar los eventos o generar los estÃmulos pueden ser complejos, integrando varias entradas de datos en tiempo real. Integrar el desarrollo de dichos algoritmos en las herramientas de adquisición forma parte del diseño experimental.
Para estudiar comportamientos naturales, los animales deben ser capaces de moverse libremente en entornos emulando condiciones naturales. Experimentos de este tipo se ven dificultados por la naturaleza cableada de los sistemas de adquisición. Otras restricciones fÃsicas, como el peso de los implantes o limitaciones en el consumo de energÃa, pueden también afectar a la duración de los experimentos, limitándola. La experimentación puede verse enriquecida cuando los datos electrofisiológicos se ven complementados con múltiples fuentes distintas. Por ejemplo, seguimiento de los animales o miscroscopÃa. Herramientas capaces de integrar datos independientemente de su origen abren la puerta a nuevas posibilidades.
Los avances tecnológicos presentados abordan estas limitaciones. Se han diseñado dispositivos con latencias de lazo cerrado inferiores a 200us que permiten combinar cientos de canales electrofisiológicos con otras fuentes de datos, como vÃdeo o seguimiento. El software de control para estos dispositivos se ha diseñado manteniendo la flexibilidad como objetivo. Se han desarrollado interfaces y estándares de naturaleza abierta para incentivar el desarrollo de herramientas compatibles entre ellas.
Para resolver los problemas de cableado se siguieron dos métodos distintos. Uno fue el desarrollo de headstages ligeros combinados con cables coaxiales ultra finos y conmutadores activos, gracias al seguimiento de animales. Este desarrollo permite reducir el esfuerzo impuesto a los animales, permitiendo espacios amplios y experimentos de larga duración, al tiempo que permite el uso de headstages con caracterÃsticas avanzadas.
Paralelamente se desarrolló un tipo diferente de headstage, con tecnologÃa inalámbrica. Se creó un algoritmo de compresión digital especializado capaz de reducir el ancho de banda a menos del 65% de su tamaño original, ahorrando energÃa. Esta reducción permite baterÃas más ligeras y mayores tiempos de operación. El algoritmo fue diseñado para ser capaz de ser implementado en una gran variedad de dispositivos.
Los desarrollos presentados abren la puerta a nuevas posibilidades experimentales para la neurociencia, combinando adquisición elextrofisiológica con estudios conductuales en condiciones naturales y estÃmulos complejos en tiempo real.[CA] L'electrofisiologia extracel·lular és una tècnica à mpliament utilitzada en la investigació neurocientÃfica, la qual permet estudiar el funcionament del cervell mitjançant el mesurament de camps elèctrics generats per l'activitat neuronal. Això es realitza a través d'elèctrodes implantats al cervell, connectats a dispositius electrònics per a l'amplificació i digitalització dels senyals. Dels molts models animals utilitzats en experimentació electrofisiològica, les rates i els ratolins es troben entre les espècies més utilitzades.
Actualment, l'experimentació electrofisiològica busca condicions cada vegada més complexes, limitades per la tecnologia dels dispositius d'adquisició. Dos aspectes són d'especial interès: La realimentació de sistemes de llaç tancat i el comportament en condicions naturals. En aquesta tesi es presenten desenvolupaments amb l'objectiu de millorar diferents aspectes d'aquestos dos problemes.
La realimentació de sistemes de llaç tancat es refereix a totes aquestes tècniques on els estÃmuls es produeixen en resposta a un esdeveniment generat per l'animal. La latència ha d'ajustar-se a les escales temporals sota estudi. Els sistemes moderns d'adquisició presenten latències en l'ordre dels 10ms. No obstant això, per a respondre a esdeveniments rà pids, com poden ser els potencials d'acció, es requereixen latències per davall de 1ms. A més a més, els algoritmes per a detectar els esdeveniments o generar els estÃmuls poden ser complexos, integrant varies entrades de dades a temps real. Integrar el desenvolupament d'aquests algoritmes en les eines d'adquisició forma part del disseny dels experiments.
Per a estudiar comportaments naturals, els animals han de ser capaços de moure's lliurement en ambients emulant condicions naturals. Aquestos experiments es veuen limitats per la natura cablejada dels sistemes d'adquisició. Altres restriccions fÃsiques, com el pes dels implants o el consum d'energia, poden també limitar la duració dels experiments. L'experimentació es pot enriquir quan les dades electrofisiològiques es complementen amb dades de múltiples fonts. Per exemple, el seguiment d'animals o microscòpia. Eines capaces d'integrar dades independentment del seu origen obrin la porta a noves possibilitats.
Els avanços tecnològics presentats tracten aquestes limitacions. S'han dissenyat dispositius amb latències de llaç tancat inferiors a 200us que permeten combinar centenars de canals electrofisiològics amb altres fonts de dades, com vÃdeo o seguiment. El software de control per a aquests dispositius s'ha dissenyat mantenint la flexibilitat com a objectiu. S'han desenvolupat interfÃcies i està ndards de naturalesa oberta per a incentivar el desenvolupament d'eines compatibles entre elles.
Per a resoldre els problemes de cablejat es van seguir dos mètodes diferents. Un va ser el desenvolupament de headstages lleugers combinats amb cables coaxials ultra fins i commutadors actius, grà cies al seguiment d'animals. Aquest desenvolupament permet reduir al mÃnim l'esforç imposat als animals, permetent espais amplis i experiments de llarga durada, al mateix temps que permet l'ús de headstages amb caracterÃstiques avançades.
Paral·lelament es va desenvolupar un tipus diferent de headstage, amb tecnologia sense fil. Es va crear un algorisme de compressió digital especialitzat capaç de reduir l'amplada de banda a menys del 65% de la seua grandà ria original, estalviant energia. Aquesta reducció permet bateries més lleugeres i majors temps d'operació. L'algorisme va ser dissenyat per a ser capaç de ser implementat a una gran varietat de dispositius.
Els desenvolupaments presentats obrin la porta a noves possibilitats experimentals per a la neurociència, combinant l'adquisició electrofisiològica amb estudis conductuals en condicions naturals i estÃmuls complexos en temps real.[EN] Extracellular electrophysiology is a technique widely used in neuroscience research. It can offer insights on how the brain works by measuring the electrical fields generated by neural activity. This is done through electrodes implanted in the brain and connected to amplification and digitization electronic circuitry. Of the many animal models used in electrophysiology experimentation, rodents such as rats and mice are among the most popular species.
Modern electrophysiology experiments seek increasingly complex conditions that are limited by acquisition hardware technology. Two particular aspects are of special interest: Closed-loop feedback and naturalistic behavior. In this thesis, we present developments aiming to improve on different facets of these two problems.
Closed-loop feedback encompasses all techniques in which stimuli is produced in response of an event generated by the animal. Latency, the time between trigger event and stimuli generation, must adjust to the biological timescale being studied. While modern acquisition systems feature latencies in the order of 10ms, response to fast events such as high-frequency electrical transients created by neuronal activity require latencies under . In addition, algorithms for triggering or generating closed-loop stimuli can be complex, integrating multiple inputs in real-time. Integration of algorithm development into acquisition tools becomes an important part of experiment design.
For electrophysiology experiments featuring naturalistic behavior, animals must be able to move freely in ecologically meaningful environments, mimicking natural conditions. Experiments featuring elements such as large arenaa, environmental objects or the presence of another animals are, however, hindered by the wired nature of acquisition systems. Other physical constraints, such as implant weight or power restrictions can also affect experiment time, limiting their duration. Beyond the technical limits, complex experiments are enriched when electrophysiology data is integrated with multiple sources, for example animal tracking or brain microscopy. Tools allowing mixing data independently of the source open new experimental possibilities.
The technological advances presented on this thesis addresses these topics. We have designed devices with closed-loop latencies under 200us while featuring high-bandwidth interfaces. These allow the simultaneous acquisition of hundreds of electrophysiological channels combined with other heterogeneous data sources, such as video or tracking. The control software for these devices was designed with flexibility in mind, allowing easy implementation of closed-loop algorithms. Open interface standards were created to encourage the development of interoperable tools for experimental data integration.
To solve wiring issues in behavioral experiments, we followed two different approaches. One was the design of light headstages, coupled with ultra-thin coaxial cables and active commutator technology, making use of animal tracking. This allowed to reduce animal strain to a minimum allowing large arenas and prolonged experiments with advanced headstages.
A different, wireless headstage was also developed. We created a digital compression algorithm specialized for neural electrophysiological signals able to reduce data bandwidth to less than 65.5% its original size without introducing distortions. Bandwidth has a large effect on power requirements. Thus, this reduction allows for lighter batteries and extended operational time. The algorithm is designed to be able to be implemented in a wide variety of devices, requiring low hardware resources and adding negligible power requirements to a system.
Combined, the developments we present open new possibilities for neuroscience experiments combining electrophysiology acquisition with natural behaviors and complex, real-time, stimuli.The research described in this thesis was carried out at the Polytechnic University of Valencia
(Universitat Politècnica de València), Valencia, Spain in an extremely close collaboration with the
Neuroscience Institute - Spanish National Research Council - Miguel Hernández University (Instituto
de Neurociencias - Consejo Superior de Investigaciones Cientà cas - Universidad Miguel Hernández),
San Juan de Alicante, Spain. The projects described in chapters 3 and 4 were developed in collabo-
ration with, and funded by, Open Ephys, Cambridge, MA, USA and OEPS - Eléctronica e produção,
unipessoal lda, Algés, Portugal.Cuevas López, A. (2021). Development of Advanced Closed-Loop Brain Electrophysiology Systems for Freely Behaving Rodents [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/179718TESI
Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)
ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability
Dynamic reconfiguration frameworks for high-performance reliable real-time reconfigurable computing
The sheer hardware-based computational performance and programming flexibility
offered by reconfigurable hardware like Field-Programmable Gate Arrays (FPGAs)
make them attractive for computing in applications that require high performance,
availability, reliability, real-time processing, and high efficiency. Fueled by fabrication
process scaling, modern reconfigurable devices come with ever greater quantities of
on-chip resources, allowing a more complex variety of applications to be developed.
Thus, the trend is that technology giants like Microsoft, Amazon, and Baidu now
embrace reconfigurable computing devices likes FPGAs to meet their critical
computing needs. In addition, the capability to autonomously reprogramme these
devices in the field is being exploited for reliability in application domains like
aerospace, defence, military, and nuclear power stations. In such applications, real-time
computing is important and is often a necessity for reliability. As such, applications and
algorithms resident on these devices must be implemented with sufficient
considerations for real-time processing and reliability.
Often, to manage a reconfigurable hardware device as a computing platform for a
multiplicity of homogenous and heterogeneous tasks, reconfigurable operating systems
(ROSes) have been proposed to give a software look to hardware-based computation.
The key requirements of a ROS include partitioning, task scheduling and allocation,
task configuration or loading, and inter-task communication and synchronization.
Existing ROSes have met these requirements to varied extents. However, they are
limited in reliability, especially regarding the flexibility of placing the hardware circuits
of tasks on device’s chip area, the problem arising more from the partitioning
approaches used. Indeed, this problem is deeply rooted in the static nature of the on-chip
inter-communication among tasks, hampering the flexibility of runtime task
relocation for reliability.
This thesis proposes the enabling frameworks for reliable, available, real-time,
efficient, secure, and high-performance reconfigurable computing by providing
techniques and mechanisms for reliable runtime reconfiguration, and dynamic inter-circuit communication and synchronization for circuits on reconfigurable hardware.
This work provides task configuration infrastructures for reliable reconfigurable
computing. Key features, especially reliability-enabling functionalities, which have
been given little or no attention in state-of-the-art are implemented. These features
include internal register read and write for device diagnosis; configuration operation
abort mechanism, and tightly integrated selective-area scanning, which aims to
optimize access to the device’s reconfiguration port for both task loading and error
mitigation.
In addition, this thesis proposes a novel reliability-aware inter-task communication
framework that exploits the availability of dedicated clocking infrastructures in a
typical FPGA to provide inter-task communication and synchronization. The clock
buffers and networks of an FPGA use dedicated routing resources, which are distinct
from the general routing resources. As such, deploying these dedicated resources for
communication sidesteps the restriction of static routes and allows a better relocation
of circuits for reliability purposes.
For evaluation, a case study that uses a NASA/JPL spectrometer data processing
application is employed to demonstrate the improved reliability brought about by the
implemented configuration controller and the reliability-aware dynamic
communication infrastructure. It is observed that up to 74% time saving can be achieved
for selective-area error mitigation when compared to state-of-the-art vendor
implementations. Moreover, an improvement in overall system reliability is observed
when the proposed dynamic communication scheme is deployed in the data processing
application.
Finally, one area of reconfigurable computing that has received insufficient
attention is security. Meanwhile, considering the nature of applications which now turn
to reconfigurable computing for accelerating compute-intensive processes, a high
premium is now placed on security, not only of the device but also of the applications,
from loading to runtime execution. To address security concerns, a novel secure and
efficient task configuration technique for task relocation is also investigated, providing
configuration time savings of up to 32% or 83%, depending on the device; and resource
usage savings in excess of 90% compared to state-of-the-art
A time-predictable many-core processor design for critical real-time embedded systems
Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget.
CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform.
The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization.
However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system.
This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas crÃticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energÃa de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas caracterÃsticas, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para asà reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafÃos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantÃas de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantÃas de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas
Scratchpad Memory Management For Multicore Real-Time Embedded Systems
Multicore systems will continue to spread in the domain of real-time embedded systems due to the increasing need for high-performance applications. This research discusses some of the challenges associated with employing multicore systems for safety-critical real-time applications. Mainly, this work is concerned with providing: 1) efficient inter-core timing isolation for independent tasks, and 2) predictable task communication for communicating tasks.
Principally, we introduce a new task execution model, based on the 3-phase execution model, that exploits the Direct Memory Access (DMA) controllers available in modern embedded platforms along with ScratchPad Memories (SPMs) to enforce strong timing isolation between tasks. The DMA and the SPMs are explicitly managed to pre-load tasks from main memory into the local (private) scratchpad memories. Tasks are then executed from the local SPMs without accessing main memory. This model allows CPU execution to be overlapped with DMA loading/unloading operations from and to main memory. We show that by co-scheduling task execution on CPUs and using DMA to access memory and I/O, we can efficiently hide access latency to physical resources. In turn, this leads to significant improvements in system schedulability, compared to both the case of unregulated contention for access to physical resources and to previous cache and SPM management techniques for real-time systems.
The presented SPM-centric scheduling algorithms and analyses cover single-core, partitioned, and global real-time systems. The proposed scheme is also extended to support large tasks that do not fit entirely into the local SPM. Moreover, the schedulability analysis considers the case of recovering from transient soft errors (bit flips caused by a single event upset) in several levels of memories, that cannot be automatically corrected in hardware by the ECC unit. The proposed SPM-centric scheduling is integrated at the OS level; thus it is transparent to applications. The proposed scheme is implemented and evaluated on an FPGA platform and a Commercial-Off-The-Shelf (COTS) platform.
In regards to real-time task communication, two types of communication are considered. 1) Asynchronous inter-task communication, between either sequential tasks (single-threaded) or parallel tasks (multi-threaded). 2) Intra-task communication, where parallel threads of the same application exchange data. A new task scheduling model for parallel tasks (Bundled Scheduling) is proposed to facilitate intra-task communication and reduce synchronization overheads. We show that the proposed bundled scheduling model can be applied to several parallel programming models, such as fork-join and DAG-based applications, leading to improved system schedulability. Finally, intra-task communication is governed by a predictable inter-core communication platform. Specifically, we propose HopliteRT, a lean and predictable Network-on-Chip that connects the private SPMs
Development and certification of mixed-criticality embedded systems based on probabilistic timing analysis
An increasing variety of emerging systems relentlessly replaces or augments the functionality of mechanical subsystems with embedded electronics. For quantity, complexity, and use, the safety of such subsystems is an increasingly important matter. Accordingly, those systems are subject to safety certification to demonstrate system's safety by rigorous development processes and hardware/software constraints. The massive augment in embedded processors' complexity renders the arduous certification task significantly harder to achieve. The focus of this thesis is to address the certification challenges in multicore architectures: despite their potential to integrate several applications on a single platform, their inherent complexity imperils their timing predictability and certification. Recently, the Measurement-Based Probabilistic Timing Analysis (MBPTA) technique emerged as an alternative to deal with hardware/software complexity. The innovation that MBPTA brings about is, however, a major step from current certification procedures and standards. The particular contributions of this Thesis include: (i) the definition of certification arguments for mixed-criticality integration upon multicore processors. In particular we propose a set of safety mechanisms and procedures as required to comply with functional safety standards. For timing predictability, (ii) we present a quantitative approach to assess the likelihood of execution-time exceedance events with respect to the risk reduction requirements on safety standards. To this end, we build upon the MBPTA approach and we present the design of a safety-related source of randomization (SoR), that plays a key role in the platform-level randomization needed by MBPTA. And (iii) we evaluate current certification guidance with respect to emerging high performance design trends like caches. Overall, this Thesis pushes the certification limits in the use of multicore and MBPTA technology in Critical Real-Time Embedded Systems (CRTES) and paves the way towards their adoption in industry.Una creciente variedad de sistemas emergentes reemplazan o aumentan la funcionalidad de subsistemas mecánicos con componentes electrónicos embebidos. El aumento en la cantidad y complejidad de dichos subsistemas electrónicos asà como su cometido, hacen de su seguridad una cuestión de creciente importancia. Tanto es asà que la comercialización de estos sistemas crÃticos está sujeta a rigurosos procesos de certificación donde se garantiza la seguridad del sistema mediante estrictas restricciones en el proceso de desarrollo y diseño de su hardware y software. Esta tesis trata de abordar los nuevos retos y dificultades dadas por la introducción de procesadores multi-núcleo en dichos sistemas crÃticos: aunque su mayor rendimiento despierta el interés de la industria para integrar múltiples aplicaciones en una sola plataforma, suponen una mayor complejidad. Su arquitectura desafÃa su análisis temporal mediante los métodos tradicionales y, asimismo, su certificación es cada vez más compleja y costosa. Con el fin de lidiar con estas limitaciones, recientemente se ha desarrollado una novedosa técnica de análisis temporal probabilÃstico basado en medidas (MBPTA). La innovación de esta técnica, sin embargo, supone un gran cambio cultural respecto a los estándares y procedimientos tradicionales de certificación. En esta lÃnea, las contribuciones de esta tesis están agrupadas en tres ejes principales: (i) definición de argumentos de seguridad para la certificación de aplicaciones de criticidad-mixta sobre plataformas multi-núcleo. Se definen, en particular, mecanismos de seguridad, técnicas de diagnóstico y reacción de faltas acorde con el estándar IEC 61508 sobre una arquitectura multi-núcleo de referencia. Respecto al análisis temporal, (ii) presentamos la cuantificación de la probabilidad de exceder un lÃmite temporal y su relación con los requisitos de reducción de riesgos derivados de los estándares de seguridad funcional. Con este fin, nos basamos en la técnica MBPTA y presentamos el diseño de una fuente de números aleatorios segura; un componente clave para conseguir las propiedades aleatorias requeridas por MBPTA a nivel de plataforma. Por último, (iii) extrapolamos las guÃas actuales para la certificación de arquitecturas multi-núcleo a una solución comercial de 8 núcleos y las evaluamos con respecto a las tendencias emergentes de diseño de alto rendimiento (caches). Con estas contribuciones, esta tesis trata de abordar los retos que el uso de procesadores multi-núcleo y MBPTA implican en el proceso de certificación de sistemas crÃticos de tiempo real y facilita, de esta forma, su adopción por la industria.Postprint (published version
Architectural Support for Hypervisor-Level Intrusion Tolerance in MPSoCs
Increasingly, more aspects of our lives rely on the correctness and safety of computing systems, namely in the embedded and cyber-physical (CPS) domains, which directly affect the physical world. While systems have been pushed to their limits of functionality and efficiency, security threats and generic hardware quality have challenged their safety.
Leveraging the enormous modular power, diversity and flexibility of these systems, often deployed in multi-processor systems-on-chip (MPSoC), requires careful orchestration of complex and heterogeneous resources, a task left to low-level software, e.g., hypervisors. In current architectures, this software forms a single point of failure (SPoF) and a worthwhile target for attacks: once compromised, adversaries can gain access to all information and full control over the platform and the environment it controls, for instance by means of privilege escalation and resource allocation. Currently, solutions to protect low-level software often rely on a simpler, underlying trusted layer which is often a SPoF itself and/or exhibits downgraded performance.
Architectural hybridization allows for the introduction of trusted-trustworthy components, which combined with fault and intrusion tolerance (FIT) techniques leveraging replication, are capable of safely handling critical operations, thus eliminating SPoFs. Performing quorum-based consensus on all critical operations, in particular privilege management, ensures no compromised low-level software can single handedly manipulate privilege escalation or resource allocation to negatively affect other system resources by propagating faults or further extend an adversary’s control. However, the performance impact of traditional Byzantine fault tolerant state-machine replication (BFT-SMR) protocols is prohibitive in the context of MPSoCs due to the high costs of cryptographic operations and the quantity of messages exchanged. Furthermore, fault isolation, one of the key prerequisites in FIT, presents a complicated challenge to tackle, given the whole system resides within one chip in such platforms.
There is so far no solution completely and efficiently addressing the SPoF issue in critical low-level management software. It is our aim, then, to devise such a solution that, additionally, reaps benefit of the tight-coupled nature of such manycore systems. In this thesis we present two architectures, using trusted-trustworthy mechanisms and consensus protocols, capable of protecting all software layers, specifically at low level, by performing critical operations only when a majority of correct replicas agree to their execution: iBFT and Midir. Moreover, we discuss ways in which these can be used at application level on the example of replicated applications sharing critical data structures. It then becomes possible to confine software-level faults and some hardware faults to the individual tiles of an MPSoC, converting tiles into fault containment domains, thus, enabling fault isolation and, consequently, making way to high-performance FIT at the lowest level
Architectural Support for Hypervisor-Level Intrusion Tolerance in MPSoCs
Increasingly, more aspects of our lives rely on the correctness and safety of computing systems, namely in the embedded and cyber-physical (CPS) domains, which directly affect the physical world. While systems have been pushed to their limits of functionality and efficiency, security threats and generic hardware quality have challenged their safety.
Leveraging the enormous modular power, diversity and flexibility of these systems, often deployed in multi-processor systems-on-chip (MPSoC), requires careful orchestration of complex and heterogeneous resources, a task left to low-level software, e.g., hypervisors. In current architectures, this software forms a single point of failure (SPoF) and a worthwhile target for attacks: once compromised, adversaries can gain access to all information and full control over the platform and the environment it controls, for instance by means of privilege escalation and resource allocation. Currently, solutions to protect low-level software often rely on a simpler, underlying trusted layer which is often a SPoF itself and/or exhibits downgraded performance.
Architectural hybridization allows for the introduction of trusted-trustworthy components, which combined with fault and intrusion tolerance (FIT) techniques leveraging replication, are capable of safely handling critical operations, thus eliminating SPoFs. Performing quorum-based consensus on all critical operations, in particular privilege management, ensures no compromised low-level software can single handedly manipulate privilege escalation or resource allocation to negatively affect other system resources by propagating faults or further extend an adversary’s control. However, the performance impact of traditional Byzantine fault tolerant state-machine replication (BFT-SMR) protocols is prohibitive in the context of MPSoCs due to the high costs of cryptographic operations and the quantity of messages exchanged. Furthermore, fault isolation, one of the key prerequisites in FIT, presents a complicated challenge to tackle, given the whole system resides within one chip in such platforms.
There is so far no solution completely and efficiently addressing the SPoF issue in critical low-level management software. It is our aim, then, to devise such a solution that, additionally, reaps benefit of the tight-coupled nature of such manycore systems. In this thesis we present two architectures, using trusted-trustworthy mechanisms and consensus protocols, capable of protecting all software layers, specifically at low level, by performing critical operations only when a majority of correct replicas agree to their execution: iBFT and Midir. Moreover, we discuss ways in which these can be used at application level on the example of replicated applications sharing critical data structures. It then becomes possible to confine software-level faults and some hardware faults to the individual tiles of an MPSoC, converting tiles into fault containment domains, thus, enabling fault isolation and, consequently, making way to high-performance FIT at the lowest level
Erreichen von Performance in Netzwerken-On-Chip für Echtzeitsysteme
In many new applications, such as in automatic driving, high performance requirements have reached safety critical real-time systems. Consequently, Networks-on-Chip (NoCs) must efficiently host new sets of highly dynamic workloads e.g., high resolution sensor fusion and data processing, autonomous decision’s making combined with machine learning.
The static platform management, as used in current safety critical systems, is no more sufficient to provide the needed level of service. A dynamic platform management could meet the challenge, but it usually suffers from a lack of predictability and the simplicity necessary for certification of safety and real-time properties. In this work, we propose a novel, global and dynamic arbitration for NoCs
with real-time QoS requirements. The mechanism decouples the admission control from arbitration in routers thereby simplifying a dynamic adaptation and real-time analysis. Consequently, the proposed solution allows the deployment of a sophisticated contract-based QoS provisioning without introducing complicated and hard to maintain schemes, known from the frequently applied static arbiters.
The presented work introduces an overlay network to synchronize transmissions using arbitration units called Resource Managers (RMs), which allows global and work-conserving scheduling. The description of resource allocation strategies is supplemented by protocol design and verification methodology bringing adaptive control to NoC communication in setups with different QoS requirements and traffic classes. For doing that, a formal worst-case timing analysis for the mechanism has been proposed which demonstrates that this solution not only exposes higher performance in simulation but, even more importantly, consistently reaches smaller formally guaranteed worst-case latencies than other strategies for realistic levels of system's utilization.
The approach is not limited to a specific network architecture or topology as the mechanism does not require modifications of routers and therefore can be used together with the majority of existing manycore systems. Indeed, the evaluation followed using the generic performance optimized router designs, as well as two systems-on-chip focused on real-time deployments. The results confirmed that the proposed approach proves to exhibit significantly higher average performance in simulation and execution.In vielen neuen sicherheitskritische Anwendungen, wie z.B. dem automatisierten
Fahren, werden große Anforderungen an die Leistung von Echtzeitsysteme gestellt.
Daher müssen Networks-on-Chip (NoCs) neue, hochdynamische Workloads
wie z.B. hochauflösende Sensorfusion und Datenverarbeitung oder autonome Entscheidungsfindung
kombiniert mit maschineller Lernen, effizient auf einem System unterbringen.
Die Steuerung der zugrunde liegenden NoC-Architektur, muss die Systemsicherheit vor Fehlern,
resultierend aus dem dynamischen Verhalten des Systems schützen und
gleichzeitig die geforderte Performance bereitstellen.
In dieser Arbeit schlagen wir eine neuartige, globale und dynamische Steuerung
für NoCs mit Echtzeit QoS Anforderungen vor. Das Schema entkoppelt die Zutrittskontrolle
von der Arbitrierung in Routern. Hierdurch wird eine dynamische Anpassung
ermöglicht und die Echtzeitanalyse vereinfacht. Der Einsatz einer ausgefeilten
vertragsbasierten Ressourcen-Zuweisung wird so ermöglicht, ohne komplexe und schwer wartbare Mechanismen, welche bereits aus dem statischen Plattformmanagement bekannt sind einzuführen.
Diese Arbeit stellt ein übergelagertes Netzwerk vor, welches Übertragungen mit
Hilfe von Arbitrierungseinheiten, den so genannten Resource Managern (RMs),
synchronisiert. Dieses überlagerte Netzwerk ermöglicht eine globale und lasterhaltende
Steuerung. Die Beschreibung verschiedener Ressourcenzuweisungstrategien
wird ergänzt durch ein Protokolldesign und Methoden zur Verifikation der
adaptiven NoC Steuerung mit unterschiedlichen QoS Anforderungen und Verkehrsklassen.
Hierfür wird eine formale Worst Case Timing Analyse präsentiert,
welche das vorgestellte Verfahren abbildet. Die Resultate bestätitgen, dass die präsentierte
Lösung nicht nur eine höhere Performance in der Simulation bietet, sondern
auch formal kleinere Worst-Case Latenzen für realistische Systemauslastungen
als andere Strategien garantiert.
Der vorgestellte Ansatz ist nicht auf eine bestimmte Netzwerkarchitektur oder
Topologie beschränkt, da der Mechanismus keine Änderungen an den unterliegenden
Routern erfordert und kann daher zusammen mit bestehenden Manycore-Systemen
eingesetzt werden. Die Evaluierung erfolgte auf Basis eines leistungsoptimierten
Router-Designs sowie zwei auf Echtzeit-Anwendungen fokusierten Platformen.
Die Ergebnisse bestätigten, dass der vorgeschlagene Ansatz im Durchschnitt
eine deutlich höhere Leistung in der Simulation und Ausführung liefert
- …