45 research outputs found

    Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis

    Get PDF
    High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are only observable on-chip. Debugging support for circuits generated with HLS is receiving a considerable attention. Among the data that can be collected on chip for debugging, one of the most important is the state of the Finite State Machines (FSM) controlling the components of the circuit. However, this usually requires a large amount of memory to trace the behavior during the execution. This work proposes an approach that takes advantage of the HLS information and of the structure of the FSM to compress control flow traces and to integrate optimized components for on-chip debugging. The generated checkers analyze the FSM execution on-fly, automatically notifying when a bug is detected, localizing it and providing data about its cause. The traces are compressed using a software profiling technique, called Efficient Path Profiling (EPP), adapted for the debugging of hardware accelerators generated with HLS. With this technique, the size of the memory used to store control flow traces can be reduced up to 2 orders of magnitude, compared to state-of-the-art

    Side-channel Attacks with Multi-thread Mixed Leakage

    Get PDF
    Side-channel attacks are one of the greatest practical threats to security-related applications, because they are capable of breaking ciphers that are assumed to be mathematically secure. Lots of studies have been devoted to power or electro-magnetic (EM) analysis against desktop CPUs, mobile CPUs (including ARM, MSP, AVR, etc) and FPGAs, but rarely targeted modern GPUs. Modern GPUs feature their special and specific single instruction multiple threads (SIMT) execution fashion, which makes their power/EM leakage more sophisticated in practical scenarios. In this paper, we study side-channel attacks with leakage from SIMT systems, and propose leakage models suited to any SIMT systems and specifically to CUDA-enabled GPUs. Afterwards, we instantiate the models with a GPU AES implementation, which is also used for performance evaluations. In addition to the models, we provide optimizations on the attacks that are based on the models. To evaluate the models and optimizations, we run the GPU AES implementation on a CUDA-enabled GPU and, at the same time, collect its EM leakage. The experimental results show that the proposed models are more efficient and the optimizations are effective as well. Our study suggests that GPU-based cryptographic implementations may be much vulnerable to microarchitecture-based side-channel attacks. Therefore, GPU-specific countermeasures should be considered for GPU-based cryptographic implementations in practical applications

    Security evaluation against side-channel analysis at compilation time

    Get PDF
    Masking countermeasure is implemented to thwart side-channel attacks. The maturity of high-order masking schemes has reached the level where the concepts are sound and proven. For instance, Rivain and Prouff proposed a full-fledged AES at CHES 2010. Some non-trivial fixes regarding refresh functions were needed though. Now, industry is adopting such solutions, and for the sake of both quality and certification requirements, masked cryptographic code shall be checked for correctness using the same model as that of the the theoretical protection rationale (for instance the probing leakage model). Seminal work has been initiated by Barthe et al. at EUROCRYPT 2015 for automated verification at higher orders on concrete implementations. In this paper, we build on this work to actually perform verification from within a compiler, so as to enable timely feedback to the developer. Precisely, our methodology enables to provide the actual security order of the code at the intermediate representation (IR) level, thereby identifying possible flaws (owing either to source code errors or to compiler optimizations). Second, our methodology allows for an exploitability analysis of the analysed IR code. In this respect, we formally handle all the symbolic expressions in the static single assignment (SSA) representation to build the optimal distinguisher function. This enables to evaluate the most powerful attack, which is not only function of the masking order dd, but also on the number of leaking samples and of the expressions (e.g., linear vs non-linear leakages). This scheme allows to evaluate the correctness of a masked cryptographic code, and also its actual security in terms of number of traces in a given deployment context (characterized by a leakage model of the target CPU and the signal-to-noise ratio of the platform)

    Vers la Compression à Tous les Niveaux de la Hiérarchie de la Mémoire

    Get PDF
    Hardware compression techniques are typically simplifications of software compression methods. They must, however, comply with area, power and latency constraints. This study unveils the challenges of adopting compression in memory design. The goal of this analysis is not to summarize proposals, but to put in evidence the solutions they employ to handle those challenges. An in-depth description of the main characteristics of multiple methods is provided, as well as criteria that can be used as a basis for the assessment of such schemes.Typically, these schemes are not very efficient, and those that do compress well decompress slowly. This work explores their granularity to redefine their perspectives and improve their efficiency, through a concept called Region-Chunk compression. Its goal is to achieve low (good) compression ratio and fast decompression latency. The key observation is that by further sub-dividing the chunks of data being compressed one can reduce data duplication. This concept can be applied to several previously proposed compressors, resulting in a reduction of their average compressed size. In particular, a single-cycle-decompression compressor is boosted to reach a compressibility level competitive to state-of-the-art proposals.Finally, to increase the probability of successfully co-allocating compressed lines, Pairwise Space Sharing (PSS) is proposed. PSS can be applied orthogonally to compaction methods at no extra latency penalty, and with a cost-effective metadata overhead. The proposed system (Region-Chunk+PSS) further enhances the normalized average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.Les techniques de compression matérielle sont généralement des simplifications des méthodes de compression logicielle. Elles doivent, toutefois, se conformer aux contraintes de surface, de puissance et de latence. Cette étude dévoile les défis de l’adoption de la compression dans la conception de la mémoire. Le but de l’analyse n’est pas de résumer les propositions, mais de mettre en évidence les solutions qu’ils emploient pour relever ces défis. Une description détaillée des principales caractéristiques de plusieurs méthodes est fournie, ainsi que des critères qui peuvent être utilisés comme base pour l’évaluation de ces systèmes.Généralement, ces schémas ne sont pas très efficaces, et les schémas qui compressent bien décompressent lentement. Ce travail explore leur granularité pour redéfinir leurs perspectives et améliorer leur efficacité, à travers un concept appelé compression Region-Chunk. Son objectif est d’obtenir un haut (bon) taux de compression et une latence de décompression rapide. L’observation clé est qu’en subdivisant davantage les blocs de données compressés, on peut réduire la duplication des données. Ce concept peut être appliqué à plusieurs compresseurs précédemment proposés, entraînant une réduction de leur taille moyenne compressée. En particulier, un compresseur à décompression à cycle unique est boosté pour atteindre un niveau de compressibilité compétitif par rapport aux propositions de pointe.Enfin, pour augmenter la probabilité de co-allouer avec succès des lignes compressées, Pairwise Space Sharing (PSS) est proposé. PSS peutêtre appliqué orthogonalement aux méthodes de compactage sans pénalité de latence supplémentaire, et avec une surcharge de métadonnées rentable. Le système proposé (Region-Chunk + PSS) améliore encore la capacité normalisé moyenne du cache de 2,7% (moyenne géométrique), tout en offrant une courte latence de décompression

    Energy Aware Runtime Systems for Elastic Stream Processing Platforms

    Get PDF
    Following an invariant growth in the required computational performance of processors, the multicore revolution started around 20 years ago. This revolution was mainly an answer to power dissipation constraints restricting the increase of clock frequency in single-core processors. The multicore revolution not only brought in the challenge of parallel programming, i.e. being able to develop software exploiting the entire capabilities of manycore architectures, but also the challenge of programming heterogeneous platforms. The question of “on which processing element to map a specific computational unit?”, is well known in the embedded community. With the introduction of general-purpose graphics processing units (GPGPUs), digital signal processors (DSPs) along with many-core processors on different system-on-chip platforms, heterogeneous parallel platforms are nowadays widespread over several domains, from consumer devices to media processing platforms for telecom operators. Finding mapping together with a suitable hardware architecture is a process called design-space exploration. This process is very challenging in heterogeneous many-core architectures, which promise to offer benefits in terms of energy efficiency. The main problem is the exponential explosion of space exploration. With the recent trend of increasing levels of heterogeneity in the chip, selecting the parameters to take into account when mapping software to hardware is still an open research topic in the embedded area. For example, the current Linux scheduler has poor performance when mapping tasks to computing elements available in hardware. The only metric considered is CPU workload, which as was shown in recent work does not match true performance demands from the applications. Doing so may produce an incorrect allocation of resources, resulting in a waste of energy. The origin of this research work comes from the observation that these approaches do not provide full support for the dynamic behavior of stream processing applications, especially if these behaviors are established only at runtime. This research will contribute to the general goal of developing energy-efficient solutions to design streaming applications on heterogeneous and parallel hardware platforms. Streaming applications are nowadays widely spread in the software domain. Their distinctive characiteristic is the retrieving of multiple streams of data and the need to process them in real time. The proposed work will develop new approaches to address the challenging problem of efficient runtime coordination of dynamic applications, focusing on energy and performance management.Efter en oföränderlig tillväxt i prestandakrav hos processorer, började den flerkärniga processor-revolutionen för ungefär 20 år sedan. Denna revolution skedde till största del som en lösning till begränsningar i energieffekten allt eftersom klockfrekvensen kontinuerligt höjdes i en-kärniga processorer. Den flerkärniga processor-revolutionen medförde inte enbart utmaningen gällande parallellprogrammering, m.a.o. förmågan att utveckla mjukvara som använder sig av alla delelement i de flerkärniga processorerna, men också utmaningen med programmering av heterogena plattformar. Frågeställningen ”på vilken processorelement skall en viss beräkning utföras?” är väl känt inom ramen för inbyggda datorsystem. Efter introduktionen av grafikprocessorer för allmänna beräkningar (GPGPU), signalprocesserings-processorer (DSP) samt flerkärniga processorer på olika system-on-chip plattformar, är heterogena parallella plattformar idag omfattande inom många domäner, från konsumtionsartiklar till mediaprocesseringsplattformar för telekommunikationsoperatörer. Processen att placera beräkningarna på en passande hårdvaruplattform kallas för utforskning av en designrymd (design-space exploration). Denna process är mycket utmanande för heterogena flerkärniga arkitekturer, och kan medföra fördelar när det gäller energieffektivitet. Det största problemet är att de olika valmöjligheterna i designrymden kan växa exponentiellt. Enligt den nuvarande trenden som förespår ökad heterogeniska aspekter i processorerna är utmaningen att hitta den mest passande placeringen av beräkningarna på hårdvaran ännu en forskningsfråga inom ramen för inbyggda datorsystem. Till exempel, den nuvarande schemaläggaren i Linux operativsystemet är inkapabel att hitta en effektiv placering av beräkningarna på den underliggande hårdvaran. Det enda mätsättet som används är processorns belastning vilket, som visats i tidigare forskning, inte motsvarar den verkliga prestandan i applikationen. Användning av detta mätsätt vid resursallokering resulterar i slöseri med energi. Denna forskning härstammar från observationerna att dessa tillvägagångssätt inte stöder det dynamiska beteendet hos ström-processeringsapplikationer (stream processing applications), speciellt om beteendena bara etableras vid körtid. Denna forskning kontribuerar till det allmänna målet att utveckla energieffektiva lösningar för ström-applikationer (streaming applications) på heterogena flerkärniga hårdvaruplattformar. Ström-applikationer är numera mycket vanliga i mjukvarudomän. Deras distinkta karaktär är inläsning av flertalet dataströmmar, och behov av att processera dem i realtid. Arbetet i denna forskning understöder utvecklingen av nya sätt för att lösa det utmanade problemet att effektivt koordinera dynamiska applikationer i realtid och fokus på energi- och prestandahantering

    Mixed Criticality Systems - A Review : (13th Edition, February 2022)

    Get PDF
    This review covers research on the topic of mixed criticality systems that has been published since Vestal’s 2007 paper. It covers the period up to end of 2021. The review is organised into the following topics: introduction and motivation, models, single processor analysis (including job-based, hard and soft tasks, fixed priority and EDF scheduling, shared resources and static and synchronous scheduling), multiprocessor analysis, related topics, realistic models, formal treatments, systems issues, industrial practice and research beyond mixed-criticality. A list of PhDs awarded for research relating to mixed-criticality systems is also included

    Quantitative Analyses of Software Product Lines

    Get PDF
    A software product-line (SPL) is a family of related software systems that are jointly developed and reuse a set of shared assets. Each individual software system in an SPL is called a software product and includes a set of mandatory and optional features, which are independent units of functionality. Software-analysis techniques, such as model checking, analyze a model of a software system to determine whether the software system satisfies its requirements. Because many software-analysis techniques are computationally intensive, and the number of software products in an SPL grows exponentially with the number of features in an SPL, it tends to be very time consuming to individually analyze each product of an SPL. Family-based analyses have adapted standard software-analysis techniques (e.g., model checking, type checking) to simultaneously analyze all of the software products in an SPL, reusing partial analysis results between different software products to speed up the analysis. However, these family-based analyses verify only the functional requirements of an SPL, and we are interested in analyzing the quality of service that different software products in an SPL would exhibit. Quantitative analyses of a software system model (e.g., of a weighted transition system) can estimate how long a system will take to reach its goal, how much energy a system will consume, and so on. Quantitative analyses are known to be computationally intensive. In this thesis, we investigate whether executing a family-based quantitative analysis on a model of an SPL is faster than individually analyzing every software product of the SPL. First, we present a family-based trace-checking analysis that facilitates the reconfig- uration of a dynamic software product line (DSPL), which is a type of SPL in which features can be activated or deactivated at runtime. We assessed whether executing the family-based trace-checking analysis is faster than executing the trace-checking analysis on every software product in three case studies. Our results indicated that the family-based trace checking analysis, when combined with simple data-abstraction over an SPL model’s quality-attribute values to facilitate sharing of partial-analysis results, is between 1.4 and 7.7 times faster than individually analyzing each software product. This suggests that abstraction over the quality-attribute values is key to make family-based trace-checking analysis efficient. Second, we consider an SPL’s maximum long-term average value of a quality attribute (e.g., because it represents the long-term rate of energy consumption of the system). Specifically, the maximum limit-average cost of a weighted transition represents an upper bound on the long-term average value of a quality attribute over an infinite execution of the system. Because computing the maximum limit-average cost of a software system is computationally intensive, we developed a family-based analysis that simultaneously computes the maximum limit-average cost for each software product in an SPL. We assessed its per- formance compared to individually analyzing each software product in two case studies. Our results suggest that our family-based analysis will perform best in SPLs in which many products share the same set of strongly connected components. Finally, because both of our family-based analyses require as input a timed (weighted) behaviour model of a Software Product Line, we present a method to learn such a timed (weighted) behaviour model. Specifically, the objective is to learn, for each transition t, a regression function that maps a software product to a real-valued weight that represents the duration of transition t’s execution in that software product. We apply supervised learning techniques, linear regression and regularized linear regression, to learn such functions. We assessed the accuracy of the learnt models against ground truth in two different SPL and also compared the accuracy of our method against two different state-of-the-art methods: Perfume and a Performance-Influence model. Our results indicate that the accuracy of our learnt models ranged from a mean error of 3.8% to a mean error of 193.0%. Our learnt models were most accurate for those transitions whose execution times had low variance across repeated executions of the transition in the same software product, and in which there is a linear relationship between the transition’s execution time and the presence of features in a software product

    GPU implementation of video analytics algorithms for aerial imaging

    Get PDF
    This work examines several algorithms that together make up parts of an image processing pipeline called Video Mosaicing and Summarization (VMZ). This pipeline takes as input geospatial or biomedical videos and produces large stitched-together frames (mosaics) of the video's subject. The content of these videos presents numerous challenges, such as poor lighting and a rapidly changing scene. The algorithms of VMZ were chosen carefully to address these challenges. With the output of VMZ, numerous tasks can be done. Stabilized imagery allows for easier object tracking, and the mosaics allow a quick understanding of the scene. These use-cases with aerial imagery are even more valuable when considered from the edge, where they can be applied as a drone is collecting the data. When executing video analytics algorithms, one of the most important metrics for real-life use is performance. All the accuracy in the world does not guarantee usefulness if the algorithms cannot provide that accuracy in a timely and actionable manner. Thus the goal of this work is to explore means and tools to implement video analytics algorithms, particularly the ones that make up the VMZ pipeline, on GPU devices{making them faster and more available for real-time use. This work presents four algorithms that have been converted to make use of the GPU in the GStreamer environment on NVIDIA GPUs. With GStreamer these algorithms are easily modular and lend themselves well to experimentation and real-life use even in pipelines beyond VMZ.Includes bibliographical references

    Efficient fault-injection-based assessment of software-implemented hardware fault tolerance

    Get PDF
    With continuously shrinking semiconductor structure sizes and lower supply voltages, the per-device susceptibility to transient and permanent hardware faults is on the rise. A class of countermeasures with growing popularity is Software-Implemented Hardware Fault Tolerance (SIHFT), which avoids expensive hardware mechanisms and can be applied application-specifically. However, SIHFT can, against intuition, cause more harm than good, because its overhead in execution time and memory space also increases the figurative “attack surface” of the system – it turns out that application-specific configuration of SIHFT is in fact a necessity rather than just an advantage. Consequently, target programs need to be analyzed for particularly critical spots to harden. SIHFT-hardened programs need to be measured and compared throughout all development phases of the program to observe reliability improvements or deteriorations over time. Additionally, SIHFT implementations need to be tested. The contributions of this dissertation focus on Fault Injection (FI) as an assessment technique satisfying all these requirements – analysis, measurement and comparison, and test. I describe the design and implementation of an FI tool, named Fail*, that overcomes several shortcomings in the state of the art, and enables research on the general drawbacks of simulation-based FI. As demonstrated in four case studies in the context of SIHFT research, Fail* provides novel fine-grained analysis techniques that exploit the newly gained possibility to analyze FI results from complete fault-space exploration. These analysis techniques aid SIHFT design decisions on the level of program modules, functions, variables, source-code lines, or single machine instructions. Based on the experience from the case studies, I address the problem of large computation efforts that accompany exhaustive fault-space exploration from two different angles: Firstly, I develop a heuristical fault-space pruning technique that allows to freely trade the total FI-experiment count for result accuracy, while still providing information on all possible faultspace coordinates. Secondly, I speed up individual TAP-based FI experiments by improving the fast-forwarding operation by several orders of magnitude for most workloads. Finally, I dissect current practices in FI-based evaluation of SIHFT-hardened programs, identify three widespread pitfalls in the result interpretation, and advance the state of the art by defining a novel comparison metric

    Designing Flexible, Energy Efficient and Secure Wireless Solutions for the Internet of Things

    Full text link
    The Internet of Things (IoT) is an emerging concept where ubiquitous physical objects (things) consisting of sensor, transceiver, processing hardware and software are interconnected via the Internet. The information collected by individual IoT nodes is shared among other often heterogeneous devices and over the Internet. This dissertation presents flexible, energy efficient and secure wireless solutions in the IoT application domain. System design and architecture designs are discussed envisioning a near-future world where wireless communication among heterogeneous IoT devices are seamlessly enabled. Firstly, an energy-autonomous wireless communication system for ultra-small, ultra-low power IoT platforms is presented. To achieve orders of magnitude energy efficiency improvement, a comprehensive system-level framework that jointly optimizes various system parameters is developed. A new synchronization protocol and modulation schemes are specified for energy-scarce ultra-small IoT nodes. The dynamic link adaptation is proposed to guarantee the ultra-small node to always operate in the most energy efficiency mode, given an operating scenario. The outcome is a truly energy-optimized wireless communication system to enable various new applications such as implanted smart-dust devices. Secondly, a configurable Software Defined Radio (SDR) baseband processor is designed and shown to be an efficient platform on which to execute several IoT wireless standards. It is a custom SIMD execution model coupled with a scalar unit and several architectural optimizations: streaming registers, variable bitwidth, dedicated ALUs, and an optimized reduction network. Voltage scaling and clock gating are employed to further reduce the power, with a more than a 100% time margin reserved for reliable operation in the near-threshold region. Two upper bound systems are evaluated. A comprehensive power/area estimation indicates that the overhead of realizing SDR flexibility is insignificant. The benefit of baseband SDR is quantified and evaluated. To further augment the benefits of a flexible baseband solution and to address the security issue of IoT connectivity, a light-weight Galois Field (GF) processor is proposed. This processor enables both energy-efficient block coding and symmetric/asymmetric cryptography kernel processing for a wide range of GF sizes (2^m, m = 2, 3, ..., 233) and arbitrary irreducible polynomials. Program directed connections among primitive GF arithmetic units enable dynamically configured parallelism to efficiently perform either four-way SIMD GF operations, including multiplicative inverse, or a long bit-width GF product in a single cycle. This demonstrates the feasibility of a unified architecture to enable error correction coding flexibility and secure wireless communication in the low power IoT domain.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137164/1/yajchen_1.pd
    corecore