4,419 research outputs found

    Software trace cache

    Get PDF
    We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version

    Instruction fetch architectures and code layout optimizations

    Get PDF
    The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version

    An integrated vector-scalar design on an in-order ARM core

    Get PDF
    In the low-end mobile processor market, power, energy, and area budgets are significantly lower than in the server/desktop/laptop/high-end mobile markets. It has been shown that vector processors are a highly energy-efficient way to increase performance; however, adding support for them incurs area and power overheads that would not be acceptable for low-end mobile processors. In this work, we propose an integrated vector-scalar design for the ARM architecture that mostly reuses scalar hardware to support the execution of vector instructions. The key element of the design is our proposed block-based model of execution that groups vector computational instructions together to execute them in a coordinated manner. We implemented a classic vector unit and compare its results against our integrated design. Our integrated design improves the performance (more than 6×) and energy consumption (up to 5×) of a scalar in-order core with negligible area overhead (only 4.7% when using a vector register with 32 elements). In contrast, the area overhead of the classic vector unit can be significant (around 44%) if a dedicated vector floating-point unit is incorporated. Our block-based vector execution outperforms the classic vector unit for all kernels with floating-point data and also consumes less energy. We also complement the integrated design with three energy/performance-efficient techniques that further reduce power and increase performance. The first proposal covers the design and implementation of chaining logic that is optimized to work with the cache hierarchy through vector memory instructions, the second proposal reduces the number of reads/writes from/to the vector register file, and the third idea optimizes complex memory access patterns with the memory shape instruction and unified indexed vector load.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA no 321253 and is supported in part by the European Union (FEDER funds) under contract TIN2015-65316-P. This research has been also supported the Agency for Management of University and Research Grants (AGAUR - FI-DGR 2014). O. Palomar is funded by a Royal Society Newton International Fellowship.Peer ReviewedPostprint (author's final draft

    Out Of Control: Overcoming Control-Flow Integrity

    Get PDF
    As existing defenses like ASLR, DEP, and stack cookies are not sufficient to stop determined attackers from exploiting our software, interest in Control Flow Integrity (CFI) is growing. In its ideal form, CFI prevents flows of control that were not intended by the original program, effectively putting a stop to exploitation based on return oriented programming (and many other attacks besides). Two main problems have prevented CFI from being deployed in practice. First, many CFI implementations require source code or debug information that is typically not available for commercial software. Second, in its ideal form, the technique is very expensive. It is for this reason that current research efforts focus on making CFI fast and practical. Specifically, much of the work on practical CFI is applicable to binaries, and improves performance by enforcing a looser notion of control flow integrity. In this paper, we examine the security implications of such looser notions of CFI: are they still able to prevent code reuse attacks, and if not, how hard is it to bypass its protection? Specifically, we show that with two new types of gadgets, return oriented programming is still possible. We assess the availability of our gadget sets, and demonstrate the practicality of these results with a practical exploit against Internet Explorer that bypasses modern CFI implementations

    Execution Integrity with In-Place Encryption

    Full text link
    Instruction set randomization (ISR) was initially proposed with the main goal of countering code-injection attacks. However, ISR seems to have lost its appeal since code-injection attacks became less attractive because protection mechanisms such as data execution prevention (DEP) as well as code-reuse attacks became more prevalent. In this paper, we show that ISR can be extended to also protect against code-reuse attacks while at the same time offering security guarantees similar to those of software diversity, control-flow integrity, and information hiding. We present Scylla, a scheme that deploys a new technique for in-place code encryption to hide the code layout of a randomized binary, and restricts the control flow to a benign execution path. This allows us to i) implicitly restrict control-flow targets to basic block entries without requiring the extraction of a control-flow graph, ii) achieve execution integrity within legitimate basic blocks, and iii) hide the underlying code layout under malicious read access to the program. Our analysis demonstrates that Scylla is capable of preventing state-of-the-art attacks such as just-in-time return-oriented programming (JIT-ROP) and crash-resistant oriented programming (CROP). We extensively evaluate our prototype implementation of Scylla and show feasible performance overhead. We also provide details on how this overhead can be significantly reduced with dedicated hardware support

    Nonidentity Matching-to-Sample with Retarded Adolescents: Stimulus Equivalences and Sample-Comparison Control

    Get PDF
    In Experiment 1, four subjects were trained to match two visual samples (A) and their respective nonidentical visual comparisons (B); i.e., A-B matching. During nonreinforced test trials, all subjects demonstrated stimulus equivalences within the context of sample-comparison reversibility (B-A matching): When B stimuli were used as samples, appropriate responding to A comparisons occurred. A-B and B-A matching persisted given novel stimuli as alternate comparisons. However, the novel comparisons were consistently selected in the presence of nonmatching stimuli: i.e., during trials comprised of a novel comparison, an A or B sample from one stimulus class, and an incorrect comparison from the other, B or A stimuli respectively. In Experiment 2, three groups of subjects were trained under three different mediated transfer paradigms (e.g., A-B, C-B matching). Tests for reversibility (e.g., B0A, B0C matching) and mediated transfer (e.g., A-C, C-A matching)evinced stimulus equivalences for 11 of 12 subjects. The 11 subjects also matched the mediated equivalences given novel comparisons; whereas, they selected the novel comparisons when combined with nonmatching stimuli. Overall, the demonstrated stimulus equivalences favor a concept learning interpretation of non-identity matching-to-sample. Additionally, the trained and mediated matching relations were comprised of complementary sets of S+ and S- rules: Any stimulus of a given class used as a sample designated both the correct and incorrect comparisons
    • …
    corecore