    Manifold Learning Side-Channel Attacks against Masked Cryptographic Implementations

    Masking, as a common countermeasure, has been widely utilized to protect cryptographic implementations against power side-channel attacks. It significantly enhances the difficulty of attacks, as the sensitive intermediate values are randomly partitioned into multiple parts and executed on different times. The adversary must amalgamate information across diverse time samples before launching an attack, which is generally accomplished by feature extraction (e.g., Points-Of-Interest (POIs) combination and dimensionality reduction). However, traditional POIs combination methods, machine learning and deep learning techniques are often too time consuming, and necessitate a significant amount of computational resources. In this paper, we undertake the first study on manifold learning and their applications against masked cryptographic implementations. The leaked information, which manifests as the manifold of high-dimensional power traces, is mapped into a low-dimensional space and achieves feature extraction through manifold learning techniques like ISOMAP, Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LE). Moreover, to reduce the complexity, we further construct explicit polynomial mappings for manifold learning to facilitate the dimensionality reduction. Compared to the classical machine learning and deep learning techniques, our schemes built from manifold learning techniques are faster, unsupervised, and only require very simple parameter tuning. Their effectiveness has been fully validated by our detailed experiments

    The Threat of Offensive AI to Organizations

    AI has provided us with the ability to automate tasks, extract information from vast amounts of data, and synthesize media that is nearly indistinguishable from the real thing. However, positive tools can also be used for negative purposes. In particular, cyber adversaries can use AI to enhance their attacks and expand their campaigns. Although offensive AI has been discussed in the past, there is a need to analyze and understand the threat in the context of organizations. For example, how does an AI-capable adversary impact the cyber kill chain? Does AI benefit the attacker more than the defender? What are the most significant AI threats facing organizations today and what will be their impact on the future? In this study, we explore the threat of offensive AI on organizations. First, we present the background and discuss how AI changes the adversary’s methods, strategies, goals, and overall attack model. Then, through a literature review, we identify 32 offensive AI capabilities which adversaries can use to enhance their attacks. Finally, through a panel survey spanning industry, government and academia, we rank the AI threats and provide insights on the adversaries


    The emergence of new non-volatile memory (NVM) technology and deep neural network (DNN) inferences bring challenges related to off-chip memory access. Ensuring crash consistency leads to additional memory operations and exposes memory update operations on the critical execution path. DNN inference execution on some accelerators suffers from intensive off-chip memory access. The focus of this dissertation is to tackle the issues related to off-chip memory in these high performance computing systems. The logging operations, required by the crash consistency, impose a significant performance overhead due to the extra memory access. To mitigate the persistence time of log requests, we introduce a load-aware log entry allocation scheme that allocates log requests to the address whose bank has the lightest workload. To address the problem of intra-record ordering, we propose to buffer log metadata in a non-volatile ADR buffer until the corresponding log can be removed. Moreover, the recently proposed LAD introduced unnecessary logging operations on multicore CPU. To reduce these unnecessary operations, we have devised two-stage transaction execution and virtual ADR buffers. To tackle the challenge of low response time and high computational intensity associated with DNN inferences, these computations are often executed on customized accelerators. However, data loading from off-chip memory typically takes longer than computing, thereby reducing performance in some scenarios, especially on edge devices. To address this issue, we propose an optimization of the widely adopted Weight Stationary dataflow to remove redundant accesses to IFMAP in off-chip memory by reordering the loops in the standard convolution operation. Furthermore, to enhance the off-chip memory throughput, we introduce the load-aware placement for data tiles on off-chip memory that reduces intra/inter contentions caused by concurrent accesses from multiple tiles and improves the off-chip memory device parallelism during access

    Dependable Embedded Systems

    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    On Information-centric Resiliency and System-level Security in Constrained, Wireless Communication

    The Internet of Things (IoT) interconnects many heterogeneous embedded devices either locally between each other, or globally with the Internet. These things are resource-constrained, e.g., powered by battery, and typically communicate via low-power and lossy wireless links. Communication needs to be secured and relies on crypto-operations that are often resource-intensive and in conflict with the device constraints. These challenging operational conditions on the cheapest hardware possible, the unreliable wireless transmission, and the need for protection against common threats of the inter-network, impose severe challenges to IoT networks. In this thesis, we advance the current state of the art in two dimensions. Part I assesses Information-centric networking (ICN) for the IoT, a network paradigm that promises enhanced reliability for data retrieval in constrained edge networks. ICN lacks a lower layer definition, which, however, is the key to enable device sleep cycles and exclusive wireless media access. This part of the thesis designs and evaluates an effective media access strategy for ICN to reduce the energy consumption and wireless interference on constrained IoT nodes. Part II examines the performance of hardware and software crypto-operations, executed on off-the-shelf IoT platforms. A novel system design enables the accessibility and auto-configuration of crypto-hardware through an operating system. One main focus is the generation of random numbers in the IoT. This part of the thesis further designs and evaluates Physical Unclonable Functions (PUFs) to provide novel randomness sources that generate highly unpredictable secrets, on low-cost devices that lack hardware-based security features. This thesis takes a practical view on the constrained IoT and is accompanied by real-world implementations and measurements. We contribute open source software, automation tools, a simulator, and reproducible measurement results from real IoT deployments using off-the-shelf hardware. The large-scale experiments in an open access testbed provide a direct starting point for future research

    Ensemble learning with discrete classifiers on small devices

    Machine learning has become an integral part of everyday life ranging from applications in AI-powered search queries to (partial) autonomous driving. Many of the advances in machine learning and its application have been possible due to increases in computation power, i.e., by reducing manufacturing sizes while maintaining or even increasing energy consumption. However, 2-3 nm manufacturing is within reach, making further miniaturization increasingly difficult while thermal design power limits are simultaneously reached, rendering entire parts of the chip useless for certain computational loads. In this thesis, we investigate discrete classifier ensembles as a resource-efficient alternative that can be deployed to small devices that only require small amounts of energy. Discrete classifiers are classifiers that can be applied -- and oftentimes also trained -- without the need for costly floating-point operations. Hence, they are ideally suited for deployment to small devices with limited resources. The disadvantage of discrete classifiers is that their predictive performance often lacks behind their floating-point siblings. Here, the combination of multiple discrete classifiers into an ensemble can help to improve the predictive performance while still having a manageable resource consumption. This thesis studies discrete classifier ensembles from a theoretical point of view, an algorithmic point of view, and a practical point of view. In the theoretical investigation, the bias-variance decomposition and the double-descent phenomenon are examined. The bias-variance decomposition of the mean-squared error is re-visited and generalized to an arbitrary twice-differentiable loss function, which serves as a guiding tool throughout the thesis. Similarly, the double-descent phenomenon is -- for the first time -- studied comprehensively in the context of tree ensembles and specifically random forests. Contrary to established literature, the experiments in this thesis indicate that there is no double-descent in random forests. While the training of ensembles is well-studied in literature, the deployment to small devices is often neglected. Additionally, the training of ensembles on small devices has not been considered much so far. Hence, the algorithmic part of this thesis focuses on the deployment of discrete classifiers and the training of ensembles on small devices. First, a novel combination of ensemble pruning (i.e., removing classifiers from the ensemble) and ensemble refinement (i.e., re-training of classifiers in the ensemble) is presented, which uses a novel proximal gradient descent algorithm to minimize a combined loss function. The resulting algorithm removes unnecessary classifiers from an already trained ensemble while improving the performance of the remaining classifiers at the same time. Second, this algorithm is extended to the more challenging setting of online learning in which the algorithm receives training examples one by one. The resulting shrub ensembles algorithm allows the training of ensembles in an online fashion while maintaining a strictly bounded memory consumption. It outperforms existing state-of-the-art algorithms under resource constraints and offers competitive performance in the general case. Last, this thesis studies the deployment of decision tree ensembles to small devices by optimizing their memory layout. The key insight here is that decision trees have a probabilistic inference time because different observations can take different paths from the root to a leaf. By estimating the probability of visiting a particular node in the tree, one can place it favorably in the memory to maximize the caching behavior and, thus, increase its performance without changing the model. Last, several real-world applications of tree ensembles and Binarized Neural Networks are presented

    SoCRocket - A flexible and extensible Virtual Platform for the development of robust Embedded Systems

    Der Schwerpunkt dieser Arbeit liegt in der Erhöhung des Abstraktionsniveaus im Entwurfsprozess, speziell dem Entwurf von Systemen auf Basis von Virtuellen Plattformen (VPs), Transaction-Level-Modellierung (TLM) und SystemC. Es wird eine ganzheitliche Methode vorgestellt, mit der komplexe eingebettete Systeme effizient modelliert werden können. Ergebnis ist eine der RTL-Synthese nahezu gleichgestellte Genauigkeit bei wesentlich höherer Flexibilität und Simulationsgeschwindigkeit. Das SoCRocket-System orientiert sich dazu an existierenden Standards und stellt Methoden zu deren effizientem Einsatz zur Verbesserung von Simulationsgeschwindigkeit und Simulationsgenauigkeit vor. So wird unter anderem gezeigt, wie moderne Multi-Kanal-Protokolle mit Split-Transfers durch Ausgleich des Intertransaktions-Timings ohne die Einführung zusätzlicher Protokollphasen zeitlich genau modelliert werden können. Standardisierungslücken in den Bereichen Speichermodellierung und Systemkonfiguration werden durch standardoffene Lösungen geschlossen. Darüber hinaus wird neue Infrastruktur zur Modellierung von Signalkommunikation auf Transaktionsebene, der Verifikation von Komponenten und der Modellierung des Energieverbrauchs vorgestellt. Zur Demonstration wurden die Kernkomponenten einer im europäischen Raumfahrtsektor maßgeblichen Hardwarebibliothek modelliert. Alle Komponenten wurden zunächst in Unit-Tests verifiziert und anschließend in einem Systemprototypen integriert. Zur Verifikation der Funktion, sowie Bestimmung von Simulationsgeschwindigkeit und zeitlicher Genauigkeit, wurde dieser für unterschiedliche Abstraktionsstufen konfiguriert und mit einem in VHDL beschriebenen RISC-Referenzentwurf (LEON3MP) verglichen. Das System mit losem Timing (LT) und blockierender Kommunikation ist im Durchschnitt 561-mal schneller als die RTL-Referenz und weist eine durchschnittliche Timing-Abweichung von 7,04% auf. Das System mit näherungsweise akkuratem Timing (AT) und nicht-blockierender Kommunikation ist 335-mal schneller. Die durchschnittliche Timing-Abweichung beträgt hier nur noch 3,03%, was einer Standardabweichung von 0.033 und damit einer sehr hohen statistischen Sicherheit entspricht. Die verschiedenen Abstraktionsniveaus können zur Realisierung mehrstufiger Architekturexplorationen eingesetzt werden. Dies wird am Beispiel einer hyperspektralen Bildkompression verdeutlicht.The focus of this work is raising the abstraction level in the development process, especially for the design of systems based on Virtual Platforms (VPs), Transaction Level Modeling (TLM), and SystemC. A holistic method for efficient modeling of complex embedded systems is presented. Results are accuracies close to RTL synthesis but at much higher flexibility, and simulation performance. The SoCRocket system integrates existing standards and introduces new methods for improvement of simulation performance and accuracy. It is shown, amongst others, how modern multi-channel protocols with split transfers can be accurately modeled by compensating inter-transaction timing without introducing additional protocol phases. Standardization gaps in the area of memory modeling and system configuration are closed by standard-open solutions. Furthermore, new infrastructure for modeling signal communication on transaction level, verification of components, and estimating power consumption are presented. All components have been verified in unit tests and were subsequently integrated in a system prototype. For functional verification, as well as measurement of simulation performance and accuracy, the prototype was configured for different abstractions and compared to a VHDL-based RISC reference design (LEON3MP). The loosely-timed platform prototype with blocking communication (LT) is in average 561 times faster than the RTL reference and shows an average timing deviation of 7,04%. The approximately-timed system (AT) with non-blocking communication is 335 times faster. Here, the timing deviation is only 3,03 %, corresponding to a standard deviation of 0.033, proving a very high statistic certainty. The system’s various abstraction levels can be exploited by a multi-stage architecture exploration. This is demonstrated by the example of a hyperspectral image compression