819 research outputs found

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    Hardware-software codesign in a high-level synthesis environment

    Get PDF
    Interfacing hardware-oriented high-level synthesis to software development is a computationally hard problem for which no general solution exists. Under special conditions, the hardware-software codesign (system-level synthesis) problem may be analyzed with traditional tools and efficient heuristics. This dissertation introduces a new alternative to the currently used heuristic methods. The new approach combines the results of top-down hardware development with existing basic hardware units (bottom-up libraries) and compiler generation tools. The optimization goal is to maximize operating frequency or minimize cost with reasonable tradeoffs in other properties. The dissertation research provides a unified approach to hardware-software codesign. The improvements over previously existing design methodologies are presented in the frame-work of an academic CAD environment (PIPE). This CAD environment implements a sufficient subset of functions of commercial microelectronics CAD packages. The results may be generalized for other general-purpose algorithms or environments. Reference benchmarks are used to validate the new approach. Most of the well-known benchmarks are based on discrete-time numerical simulations, digital filtering applications, and cryptography (an emerging field in benchmarking). As there is a need for high-performance applications, an additional requirement for this dissertation is to investigate pipelined hardware-software systems\u27 performance and design methods. The results demonstrate that the quality of existing heuristics does not change in the enhanced, hardware-software environment

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    GPU-based Architecture Modeling and Instruction Set Extension for Signal Processing Applications

    Get PDF
    The modeling of embedded systems attempts to estimate the performance and costs prior to the implementation. The early stage predictions for performance and power dissipation reduces the more costly late stage design modifications. Workload modeling is an approach where an abstract application is evaluated against an abstract architecture. The challenge in modeling is the balance between fidelity and simplicity, where fidelity refers to the correctness of the predictions and the simplicity relates to the simulation time of the model and its ease of comprehension for the developer. A model named GSLA for performance and power modeling is presented, which extends existing architecture modeling by including GPUs as parallel processing elements. The performance model showed an average fidelity of 93% and the power model demonstrated an average fidelity of 84% between the models and several application measurements. The GSLA model is very simple: only 2 parameters that can be obtained by automated scripts. Besides the modeling, this thesis addresses lower level signal processing system improvements by proposing Instruction Set Architecture (ISA) extensions for RISC-V processors. A vehicle classifier neural network model was used as a case study, in which the benefit of Bit Manipulation Instructions (BMI) is shown. The result is a new PopCount instruction extension that is verified in ETISS simulator. The PopCount extension of RISC-V ISA showed a performance improvement of more than double for the vehicle classifier application. In addition, the design flow for adding a new instruction extension for a re-configurable platform is presented. The GPU modeling and the RISC-V ISA extension added new features to the state of the art. They improve the modeling features as well as reduce the execution costs in signal processing platforms

    Testing embedded software in a simulated environment

    Get PDF
    Abstract. In this master’s thesis, a simulation environment that can be used to execute embedded software’s unit tests is implemented. The purpose of the simulation is to make the development of the embedded firmware easier, cheaper, and faster. Also, the purpose is to make remote work easier by enabling unit test and integration test execution on a laptop. This topic has been researched a lot before and many different solutions and tools exist for embedded system simulation. Some of these solutions are introduced in this paper. After the introduction, two of the solutions are implemented for one embedded system that uses monolithic firmware. The solutions implemented are emulation based on the Unicorn emulator and a simulation with native execution on a PC. Each solution has advantages and disadvantages. But in this case, the native execution on a PC was better, as the test execution was two times faster than in Unicorn emulator and three times faster than in an embedded device. Native execution was also easier to implement than Unicorn emulator and could use free compilers like GCC and Clang. The biggest disadvantage with native execution was the low fidelity.Sulautetun ohjelmiston testaaminen simuloidussa ympäristössä. Tiivistelmä. Tässä diplomityössä tehdään simulointiympäristö, jolla voidaan ajaa sulautetun järjestelmän yksikkö- ja integraatiotestejä. Simulaation tarkoitus on tehdä sulautetun järjestelmän ohjelmistokehitys helpommaksi, halvemmaksi ja nopeammaksi. Lisäksi simulaatiolla saadaan tehtyä etätyöskentely helpommaksi, kun yksikkö- ja integraatiotestit voidaan ajaa kannettavalla tietokoneella. Sulautetun järjestelmän simulointia on tutkittu paljon ja simulointiin on kehitetty monia eri ratkaisuja ja työkaluja. Osa näistä työkaluista esitellään tässä diplomityössä. Esittelyn jälkeen toteutetaan kaksi eri simulointi ympäristöä yhdelle sulautetulle järjestelmälle. Toteutetut simulaatiot ovat: emulaatio joka tehdään Unicorn emulaattorilla ja simulaatio joka toteutetaan natiiviajona PC:llä. Molemmilla ratkaisuilla on hyvät ja huonot puolet. Mutta kokonaisuutena natiiviajo oli parempi tälle sulautetulle järjestelmälle, koska natiiviajo oli kaksi kertaa nopeampi kuin Unicorn emulaattori ja kolme kertaa nopeampi kuin sulautettu järjestelmä. Lisäksi natiiviajo oli helpompi toteuttaa kuin Unicorn emulaattori ja natiiviajossa voitiin käytettään ilmaisia kääntäjiä kuten GCC ja Clang. Huonoin puoli natiiviajossa oli se, että natiiviajon tarkkuus ei ollut kovin hyvä, eikä sillä näin ollen pystynyt testaamaan kaikkia asioita koodista

    Post-Quantum Cryptography for Internet of Things: A Survey on Performance and Optimization

    Full text link
    Due to recent development in quantum computing, the invention of a large quantum computer is no longer a distant future. Quantum computing severely threatens modern cryptography, as the hard mathematical problems beneath classic public-key cryptosystems can be solved easily by a sufficiently large quantum computer. As such, researchers have proposed PQC based on problems that even quantum computers cannot efficiently solve. Generally, post-quantum encryption and signatures can be hard to compute. This could potentially be a problem for IoT, which usually consist lightweight devices with limited computational power. In this paper, we survey existing literature on the performance for PQC in resource-constrained devices to understand the severeness of this problem. We also review recent proposals to optimize PQC algorithms for resource-constrained devices. Overall, we find that whilst PQC may be feasible for reasonably lightweight IoT, proposals for their optimization seem to lack standardization. As such, we suggest future research to seek coordination, in order to ensure an efficient and safe migration toward IoT for the post-quantum era.Comment: 13 pages, 3 figures and 7 tables. Formatted version submitted to ACM Computer Survey

    Sequential decomposition of operations and compilers optimization

    Get PDF
    Code optimization is an important area of research that has remarkable contributions in addressing the challenges of information technology. It has introduced a new trend in hardware as well as in software. Efforts that have been made in this context led to introduce a new foundation, both for compilers and processors. In this report we study different techniques used for sequential decomposition of mappings without using extra variables. We focus on finding and improving these techniques of computations. Especially, we are interested in developing methods and efficient heuristic algorithms to find the decompositions and implementing these methods in particular cases. We want to implement these methods in a compiler with an aim of optimizing code in machine language. It is always possible to calculate an operation related to K registers by a sequence of assignments using only these K registers. We verified the results and introduced new methods. We described In Situ computation of linear mapping by a sequence of linear assignments over the set of integers and investigated bound for the algorithm. We introduced a method for the case of boolean bijective mappings via algebraic operations over polynomials in GF(2). We implemented these methods using Mapl

    A survey on hardware-based malware detection approaches

    Get PDF
    This paper delves into the dynamic landscape of computer security, where malware poses a paramount threat. Our focus is a riveting exploration of the recent and promising hardware-based malware detection approaches. Leveraging hardware performance counters and machine learning prowess, hardware-based malware detection approaches bring forth compelling advantages such as real-time detection, resilience to code variations, minimal performance overhead, protection disablement fortitude, and cost-effectiveness. Navigating through a generic hardware-based detection framework, we meticulously analyze the approach, unraveling the most common methods, algorithms, tools, and datasets that shape its contours. This survey is not only a resource for seasoned experts but also an inviting starting point for those venturing into the field of malware detection. However, challenges emerge in detecting malware based on hardware events. We struggle with the imperative of accuracy improvements and strategies to address the remaining classification errors. The discussion extends to crafting mixed hardware and software approaches for collaborative efficacy, essential enhancements in hardware monitoring units, and a better understanding of the correlation between hardware events and malware applications

    Fault Detection Methodology for Caches in Reliable Modern VLSI Microprocessors based on Instruction Set Architectures

    Get PDF
    Η παρούσα διδακτορική διατριβή εισάγει μία χαμηλού κόστους μεθοδολογία για την ανίχνευση ελαττωμάτων σε μικρές ενσωματωμένες κρυφές μνήμες που βασίζεται σε σύγχρονες Αρχιτεκτονικές Συνόλου Εντολών και εφαρμόζεται με λογισμικό αυτοδοκιμής. Η προτεινόμενη μεθοδολογία εφαρμόζει αλγορίθμους March μέσω λογισμικού για την ανίχνευση τόσο ελαττωμάτων αποθήκευσης όταν εφαρμόζεται σε κρυφές μνήμες που περιέχουν μόνο στατικές μνήμες τυχαίας προσπέλασης όπως για παράδειγμα κρυφές μνήμες επιπέδου 1, όσο και ελαττωμάτων σύγκρισης όταν εφαρμόζεται σε κρυφές μνήμες που περιέχουν εκτός από SRAM μνήμες και μνήμες διευθυνσιοδοτούμενες μέσω περιεχομένου, όπως για παράδειγμα πλήρως συσχετιστικές κρυφές μνήμες αναζήτησης μετάφρασης. Η προτεινόμενη μεθοδολογία εφαρμόζεται και στις τρεις οργανώσεις συσχετιστικότητας κρυφής μνήμης και είναι ανεξάρτητη της πολιτικής εγγραφής στο επόμενο επίπεδο της ιεραρχίας. Η μεθοδολογία αξιοποιεί υπάρχοντες ισχυρούς μηχανισμούς των μοντέρνων ISAs χρησιμοποιώντας ειδικές εντολές, που ονομάζονται στην παρούσα διατριβή Εντολές Άμεσης Προσπέλασης Κρυφής Μνήμης (Direct Cache Access Instructions - DCAs). Επιπλέον, η προτεινόμενη μεθοδολογία εκμεταλλεύεται τους έμφυτους μηχανισμούς καταγραφής απόδοσης και τους μηχανισμούς χειρισμού παγίδων που είναι διαθέσιμοι στους σύγχρονους επεξεργαστές. Επιπρόσθετα, η προτεινόμενη μεθοδολογία εφαρμόζει την λειτουργία σύγκρισης των αλγορίθμων March όταν αυτή απαιτείται (για μνήμες CAM) και επαληθεύει το αποτέλεσμα του ελέγχου μέσω σύντομης απόκρισης, ώστε να είναι συμβατή με τις απαιτήσεις του ελέγχου εντός λειτουργίας. Τέλος, στη διατριβή προτείνεται μία βελτιστοποίηση της μεθοδολογίας για πολυνηματικές, πολυπύρηνες αρχιτεκτονικές.The present PhD thesis introduces a low cost fault detection methodology for small embedded cache memories that is based on modern Instruction Set Architectures and is applied with Software-Based Self-Test (SBST) routines. The proposed methodology applies March tests through software to detect both storage faults when applied to caches that comprise Static Random Access Memories (SRAM) only, e.g. L1 caches, and comparison faults when applied to caches that apart from SRAM memories comprise Content Addressable Memories (CAM) too, e.g. Translation Lookaside Buffers (TLBs). The proposed methodology can be applied to all three cache associativity organizations: direct mapped, set-associative and full-associative and it does not depend on the cache write policy. The methodology leverages existing powerful mechanisms of modern ISAs by utilizing instructions that we call in this PhD thesis Direct Cache Access (DCA) instructions. Moreover, our methodology exploits the native performance monitoring hardware and the trap handling mechanisms which are available in modern microprocessors. Moreover, the proposed Methodology applies March compare operations when needed (for CAM arrays) and verifies the test result with a compact response to comply with periodic on-line testing needs. Finally, a multithreaded optimization of the proposed methodology that targets multithreaded, multicore architectures is also presented in this thesi

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
    corecore