149 research outputs found
Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems
Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process
Pre-validation of SoC via hardware and software co-simulation
Abstract. System-on-chips (SoCs) are complex entities consisting of multiple hardware and software components. This complexity presents challenges in their design, verification, and validation. Traditional verification processes often test hardware models in isolation until late in the development cycle. As a result, cooperation between hardware and software development is also limited, slowing down bug detection and fixing.
This thesis aims to develop, implement, and evaluate a co-simulation-based pre-validation methodology to address these challenges. The approach allows for the early integration of hardware and software, serving as a natural intermediate step between traditional hardware model verification and full system validation. The co-simulation employs a QEMU CPU emulator linked to a register-transfer level (RTL) hardware model. This setup enables the execution of software components, such as device drivers, on the target instruction set architecture (ISA) alongside cycle-accurate RTL hardware models.
The thesis focuses on two primary applications of co-simulation. Firstly, it allows software unit tests to be run in conjunction with hardware models, facilitating early communication between device drivers, low-level software, and hardware components. Secondly, it offers an environment for using software in functional hardware verification.
A significant advantage of this approach is the early detection of integration errors. Software unit tests can be executed at the IP block level with actual hardware models, a task previously only possible with costly system-level prototypes. This enables earlier collaboration between software and hardware development teams and smoothens the transition to traditional system-level validation techniques.Järjestelmäpiirin esivalidointi laitteiston ja ohjelmiston yhteissimulaatiolla. Tiivistelmä. Järjestelmäpiirit (SoC) ovat monimutkaisia kokonaisuuksia, jotka koostuvat useista laitteisto- ja ohjelmistokomponenteista. Tämä monimutkaisuus asettaa haasteita niiden suunnittelulle, varmennukselle ja validoinnille. Perinteiset varmennusprosessit testaavat usein laitteistomalleja eristyksissä kehityssyklin loppuvaiheeseen saakka. Tämän myötä myös yhteistyö laitteisto- ja ohjelmistokehityksen välillä on vähäistä, mikä hidastaa virheiden tunnistamista ja korjausta.
Tämän diplomityön tavoitteena on kehittää, toteuttaa ja arvioida laitteisto-ohjelmisto-yhteissimulointiin perustuva esivalidointimenetelmä näiden haasteiden ratkaisemiseksi. Menetelmä mahdollistaa laitteiston ja ohjelmiston varhaisen integroinnin, toimien luonnollisena välietappina perinteisen laitteistomallin varmennuksen ja koko järjestelmän validoinnin välillä. Yhteissimulointi käyttää QEMU suoritinemulaattoria, joka on yhdistetty rekisterinsiirtotason (RTL) laitteistomalliin. Tämä mahdollistaa ohjelmistokomponenttien, kuten laiteajureiden, suorittamisen kohdejärjestelmän käskysarja-arkkitehtuurilla (ISA) yhdessä kellosyklitarkkojen RTL laitteistomallien kanssa.
Työ keskittyy kahteen yhteissimulaation pääsovellukseen. Ensinnäkin se mahdollistaa ohjelmiston yksikkötestien suorittamisen laitteistomallien kanssa, varmistaen kommunikaation laiteajurien, matalan tason ohjelmiston ja laitteistokomponenttien välillä. Toiseksi se tarjoaa ympäristön ohjelmiston käyttämiseen toiminnallisessa laitteiston varmennuksessa.
Merkittävä etu tästä lähestymistavasta on integraatiovirheiden varhainen havaitseminen. Ohjelmiston yksikkötestejä voidaan suorittaa jo IP-lohkon tasolla oikeilla laitteistomalleilla, mikä on aiemmin ollut mahdollista vain kalliilla järjestelmätason prototyypeillä. Tämä mahdollistaa aikaisemman ohjelmisto- ja laitteistokehitystiimien välisen yhteistyön ja helpottaa siirtymistä perinteisiin järjestelmätason validointimenetelmiin
Recent Advances in Embedded Computing, Intelligence and Applications
The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
Real-time high-performance computing for embedded control systems
The real-time control systems industry is moving towards the consolidation of multiple computing systems into fewer and more powerful ones, aiming for a reduction in size, weight, and power. The increasing demand for higher performance in other critical domains like autonomous driving has led the industry to recently include embedded GPUs for the implementation of advanced functionalities. The highly parallel architecture of GPUs could also be leveraged in the control systems industry to develop more advanced, energy-efficient, and scalable control systems. However, the closed-source and non-deterministic nature of GPUs complicates the resource provisioning analysis required for the implementation of critical real-time systems. On the other hand, there is no indication of the integration of GPUs in the traditional development cycle of control systems, which is oriented to the use of a model-based design approach. Recently, some model-based design tools vendors have extended their development frameworks with GPU code generation capabilities targeting hybrid computing platforms, so that the model-based design environment now enables the concurrent analysis of more complex and diverse functions by simulation and automating the deployment to the final target. However, there is no indication whether these tools are well-suited for the design and development of time-sensitive systems.
Motivated by these challenges, in this thesis, we contribute to the state of the art of real-time control systems towards the adoption of embedded GPUs by providing tools to facilitate the resource provisioning analysis and the integration in the model-based design development cycle. First, we present a methodology and an automated tool to extract the properties of GPU memory allocators. This tool allows the computation of the real amount of memory used by GPU applications, facilitating a correct resource provisioning analysis. Then, we present a library which allows the characterization of the use of dynamic memory in GPU applications. We use this library to characterize GPU benchmarks and we identify memory allocation patterns that could be modified to improve performance and memory consumption when targeting embedded GPUs. Based on these results, we present a tool to optimize the use of dynamic memory in legacy GPU applications executed on embedded platforms. This tool allows us to minimize the memory consumption and memory management overhead of GPU applications without rewriting them. Afterwards, we analyze the timing of control algorithms executed in embedded GPUs and we identify techniques to achieve an acceptable real-time behavior. Finally, we evaluate model-based design tools in terms of integration with GPU hardware and GPU code generation, and we propose improvements for the model-based generated GPU code. Then, we present a source-to-source transformation tool to automatically apply the proposed improvements.La industria de los sistemas de control en tiempo real avanza hacia la consolidación de múltiples sistemas informáticos en menos y más potentes sistemas, con el objetivo de reducir el tamaño, el peso y el consumo. La creciente demanda de un mayor rendimiento en otros dominios críticos, como la conducción autónoma, ha llevado a la industria a incluir recientemente GPU embebidas para la implementación de funcionalidades avanzadas. La arquitectura altamente paralela de las GPU también podría aprovecharse en la industria de los sistemas de control para desarrollar sistemas de control más avanzados, eficientes energéticamente y escalables. Sin embargo, la naturaleza privativa y no determinista de las GPUs complica el análisis de aprovisionamiento de recursos requerido para la implementación de sistemas críticos en tiempo real. Por otro lado, no hay indicios de la integración de las GPU en el ciclo de desarrollo tradicional de los sistemas de control, que está orientado al uso de un enfoque de diseño basado en modelos. Recientemente, algunos proveedores de herramientas de diseño basado en modelos han ampliado sus entornos de desarrollo con capacidades de generación de código de GPU dirigidas a plataformas informáticas híbridas, de modo que el entorno de diseño basado en modelos ahora permite el análisis simultáneo de funciones más complejas y diversas mediante la simulación y la automatización de la implementación para el objetivo final. Sin embargo, no hay indicación de si estas herramientas son adecuadas para el diseño y desarrollo de sistemas sensibles al tiempo. Motivados por estos desafíos, en esta tesis contribuimos al estado del arte de los sistemas de control en tiempo real hacia la adopción de GPUs integradas al proporcionar herramientas para facilitar el análisis de aprovisionamiento de recursos y la integración en el ciclo de desarrollo de diseño basado en modelos. Primero, presentamos una metodología y una herramienta automatizada para extraer las propiedades de los asignadores de memoria en GPUs. Esta herramienta permite el cómputo de la cantidad real de memoria utilizada por las aplicaciones GPU, facilitando un correcto análisis del aprovisionamiento de recursos. Luego, presentamos una librería que permite la caracterización del uso de memoria dinámica en aplicaciones de GPU. Usamos esta librería para caracterizar una serie de benchmarks GPU e identificamos patrones de asignación de memoria que podrían modificarse para mejorar el rendimiento y el consumo de memoria al utilizar GPUs embebidas. Con base en estos resultados, presentamos también una herramienta para optimizar el uso de la memoria dinámica en aplicaciones de GPU heredadas al ser ejecutadas en plataformas embebidas. Esta herramienta nos permite minimizar el consumo de memoria y la sobrecarga de administración de memoria de las aplicaciones GPU sin necesidad de reescribirlas. Posteriormente, analizamos el tiempo de los algoritmos de control ejecutados en GPUs embebidas e identificamos técnicas para lograr un comportamiento de tiempo real aceptable. Finalmente, evaluamos las herramientas de diseño basadas en modelos en términos de integración con hardware GPU y generación de código GPU, y proponemos mejoras para el código GPU generado por las herramientas basadas en modelos. Luego, presentamos una herramienta de transformación de código fuente para aplicar automáticamente al código generado las mejoras propuestas.Postprint (published version
Online learning on the programmable dataplane
This thesis makes the case for managing computer networks with datadriven methods automated statistical inference and control based on measurement data and runtime observations—and argues for their tight integration with programmable dataplane hardware to make management decisions faster and from more precise data. Optimisation, defence, and measurement of networked infrastructure are each challenging tasks in their own right, which are currently dominated by the use of hand-crafted heuristic methods. These become harder to reason about and deploy as networks scale in rates and number of forwarding elements, but their design requires expert knowledge and care around unexpected protocol interactions. This makes tailored, per-deployment or -workload solutions infeasible to develop. Recent advances in machine learning offer capable function approximation and closed-loop control which suit many of these tasks. New, programmable dataplane hardware enables more agility in the network— runtime reprogrammability, precise traffic measurement, and low latency on-path processing. The synthesis of these two developments allows complex decisions to be made on previously unusable state, and made quicker by offloading inference to the network.
To justify this argument, I advance the state of the art in data-driven defence of networks, novel dataplane-friendly online reinforcement learning algorithms, and in-network data reduction to allow classification of switchscale data. Each requires co-design aware of the network, and of the failure modes of systems and carried traffic. To make online learning possible in the dataplane, I use fixed-point arithmetic and modify classical (non-neural) approaches to take advantage of the SmartNIC compute model and make use of rich device local state. I show that data-driven solutions still require great care to correctly design, but with the right domain expertise they can improve on pathological cases in DDoS defence, such as protecting legitimate UDP traffic. In-network aggregation to histograms is shown to enable accurate classification from fine temporal effects, and allows hosts to scale such classification to far larger flow counts and traffic volume. Moving reinforcement learning to the dataplane is shown to offer substantial benefits to stateaction latency and online learning throughput versus host machines; allowing policies to react faster to fine-grained network events. The dataplane environment is key in making reactive online learning feasible—to port further algorithms and learnt functions, I collate and analyse the strengths of current and future hardware designs, as well as individual algorithms
A survey of techniques for reducing interference in real-time applications on multicore platforms
This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.This work was supported in part by the Comunidad de Madrid Government "Nuevas Técnicas de Desarrollo de Software de Tiempo Real Embarcado Para Plataformas. MPSoC de Próxima Generación" under Grant IND2019/TIC-17261
Embedded Machine Learning: Emphasis on Hardware Accelerators and Approximate Computing for Tactile Data Processing
Machine Learning (ML) a subset of Artificial Intelligence (AI) is driving the industrial
and technological revolution of the present and future. We envision a world with smart
devices that are able to mimic human behavior (sense, process, and act) and perform
tasks that at one time we thought could only be carried out by humans. The vision
is to achieve such a level of intelligence with affordable, power-efficient, and fast hardware
platforms. However, embedding machine learning algorithms in many application domains
such as the internet of things (IoT), prostheses, robotics, and wearable devices is an ongoing
challenge. A challenge that is controlled by the computational complexity of ML algorithms,
the performance/availability of hardware platforms, and the application\u2019s budget (power
constraint, real-time operation, etc.). In this dissertation, we focus on the design and
implementation of efficient ML algorithms to handle the aforementioned challenges. First, we
apply Approximate Computing Techniques (ACTs) to reduce the computational complexity of
ML algorithms. Then, we design custom Hardware Accelerators to improve the performance
of the implementation within a specified budget. Finally, a tactile data processing application
is adopted for the validation of the proposed exact and approximate embedded machine
learning accelerators.
The dissertation starts with the introduction of the various ML algorithms used for
tactile data processing. These algorithms are assessed in terms of their computational
complexity and the available hardware platforms which could be used for implementation.
Afterward, a survey on the existing approximate computing techniques and hardware
accelerators design methodologies is presented. Based on the findings of the survey, an
approach for applying algorithmic-level ACTs on machine learning algorithms is provided.
Then three novel hardware accelerators are proposed: (1) k-Nearest Neighbor (kNN) based
on a selection-based sorter, (2) Tensorial Support Vector Machine (TSVM) based on Shallow
Neural Networks, and (3) Hybrid Precision Binary Convolution Neural Network (BCNN).
The three accelerators offer a real-time classification with monumental reductions in the
hardware resources and power consumption compared to existing implementations targeting
the same tactile data processing application on FPGA. Moreover, the approximate accelerators
maintain a high classification accuracy with a loss of at most 5%
Virginia Commonwealth University Courses
Listing of courses for the 2022-2023 year
- …