94 research outputs found

    Time Protection: the Missing OS Abstraction

    Get PDF
    Timing channels enable data leakage that threatens the security of computer systems, from cloud platforms to smartphones and browsers executing untrusted third-party code. Preventing unauthorised information flow is a core duty of the operating system, however, present OSes are unable to prevent timing channels. We argue that OSes must provide time protection in addition to the established memory protection. We examine the requirements of time protection, present a design and its implementation in the seL4 microkernel, and evaluate its efficacy as well as performance overhead on Arm and x86 processors

    Assessing the security of hardware-assisted isolation techniques

    Get PDF

    SafeBet: Secure, Simple, and Fast Speculative Execution

    Full text link
    Spectre attacks exploit microprocessor speculative execution to read and transmit forbidden data outside the attacker's trust domain and sandbox. Recent hardware schemes allow potentially-unsafe speculative accesses but prevent the secret's transmission by delaying most access-dependent instructions even in the predominantly-common, no-attack case, which incurs performance loss and hardware complexity. Instead, we propose SafeBet which allows only, and does not delay most, safe accesses, achieving both security and high performance. SafeBet is based on the key observation that speculatively accessing a destination location is safe if the location's access by the same static trust domain has been committed previously; and potentially unsafe, otherwise. We extend this observation to handle inter trust-domain code and data interactions. SafeBet employs the Speculative Memory Access Control Table (SMACT) to track non-speculative trust domain code region-destination pairs. Disallowed accesses wait until reaching commit to trigger well-known replay, with virtually no change to the pipeline. Software simulations using SpecCPU benchmarks show that SafeBet uses an 8.3-KB SMACT per core to perform within 6% on average (63% at worst) of the unsafe baseline behind which NDA-restrictive, a previous scheme of security and hardware complexity comparable to SafeBet's, lags by 83% on average

    Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

    Get PDF
    International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler

    Performance Optimization of Memory Intensive Applications on FPGA Accelerator

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    새로운 메모리 기술을 사용하는 시스템의 성능 향상

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2018. 2. 최기영.Emerging memory technologies such as 3D-stacked memory or STT-RAM have higher density than traditional SRAM technology. As a result, these new memory technologies have recently been integrated with processors on the same chip or in the same package. These integrated emerging memory technologies provide more capacity to the processors than traditional SRAMs. Therefore, in order to improve the performance of the chip or the package, it is also important to effectively manage the memories as well as improve the performance of the processors themselves. This dissertation researches two approaches to improve the performance of systems in which processors and emerging memories are integrated on a single chip or in a single package. The first part of this dissertation focuses on improving the performance of a system in which 3D-stacked memory is integrated with the processor in a package, assuming that the processor is generic and the memory access pattern is not predefined. A DRAM cache technique is proposed, which combines the previous approaches in a synergistic way by devising a module called dirty-block tracker to maintain dirtiness of each block in a dirty-region. The approach avoids unnecessary tag checking for a write operation if the corresponding block in the cache is not dirty. Simulation results show that the proposed technique achieves significant performance improvement on average over the state-of-the-art DRAM cache technique. The second part of this dissertation focuses on improving the performance of a system in which an accelerator and STT-RAM are integrated on a single chip, assuming that certain algorithms, called deep neural networks, are processed on this system. A high-performance, energy-efficient accelerator is designed considering the characteristics of the neural network. While negative inputs for ReLU are useless, it consumes a lot of computing power to calculate them for deep neural networks. A computation pruning technique is proposed that detects at an early stage that the result of a sum of products will be negative by adopting an inverted two's complement expression for weights and a bit-serial sum of products. Therefore, it can skip a large amount of computations for negative results and simply set the ReLU outputs to zero. Moreover, a DNN accelerator architecture is devised that can efficiently apply the proposed technique. The evaluation shows that the accelerator using the computation pruning through early negative detection technique significantly improves the energy efficiency and the performance.1 Introduction 1 1.1 A DRAM Cache using 3D-stacked Memory 1 1.2 A Deep Neural Network Accelerator with STT-RAM 5 2 A DRAM Cache using 3D-stacked Memory 7 2.1 Background 7 2.1.1 Loh-Hill DRAM Cache 8 2.1.2 Alloy Cache 9 2.1.3 Mostly-Clean DRAM Cache 10 2.2 Direct-mapped DRAM Cache with Self-balancing Dispatch 12 2.2.1 A Naıve Approach 13 2.2.2 Dirty-Block Tracker (DiBT) 20 2.2.3 Sampling Hit-Miss Predictor 31 2.3 Evaluation Methodology 32 2.3.1 Experimental Setup 32 2.3.2 Workloads 33 2.4 Results 36 2.4.1 Performance 36 2.4.2 Analysis 38 2.4.3 Prediction Accuracy 42 2.4.4 Sensitivity to Sampling Hit-miss Predictor to VUPPER 43 2.4.5 Sensitivity to Dirty-Block Table Size 45 2.4.6 Scalability 46 2.4.7 Implementation Cost 46 2.5 Related Work 49 2.6 Summary 50 3 A Deep Neural Network Accelerator with STT-RAM 52 3.1 Background 52 3.1.1 Computations in CNNs 52 3.1.2 Sign Distribution of Inputs to ReLU 53 3.1.3 Twos Complement Representation 54 3.2 Early Negative Detection 55 3.2.1 Bit-serial Sum of Products 55 3.2.2 Inverted Twos Complement Representation 58 3.2.3 Early Negative Detection 58 3.3 Accelerator 60 3.3.1 Overall Architecture 61 3.3.2 Data block 62 3.3.3 Processing Unit 62 3.3.4 Buffers 65 3.3.5 Memory Controller 65 3.3.6 Providing Network 66 3.3.7 Pipelined Bit-serial Sum of Products 67 3.3.8 Global Controller 68 3.4 Evaluation 71 3.4.1 Methodology 72 3.4.2 Workloads 74 3.4.3 Normalized Runtime 77 3.4.4 Normalized Energy Consumption 80 3.4.5 Power Consumption 83 3.4.6 Normalized EDP and ED2P 85 3.4.7 Area 87 3.5 Related work 87 3.6 Summary 89 4 Conclusion 91 Abstract (In korean) 100Docto
    corecore