417 research outputs found

    A VHDL model of a superscalar implementation of the DLX instruction set architcture

    Get PDF
    The complexity of today\u27s microprocessors demands that designers have an extensive knowledge of superscalar design techniques; this knowledge is difficult to acquire outside of a professional design team. Presently, there are a limited number of adequate resources available for the student, both in textual and model form. The limited number of options available emphasizes the need for more models and simulators, allowing students the opportunity to learn more about superscalar designs prior to entering the work force. This thesis details the design and implementation of a superscalar version of the DLX instruction set architecture in behavioral VHDL. The branch prediction strategy, instruction issue model, and hazard avoidance techniques are all issues critical to superscalar processor design and are studied in this thesis. Preliminary test results demonstrate that the performance advantage of the superscalar processor is applicable even to short test sequences. Initial findings have shown a performance improvement of 26% to 57% for instruction sequences under 150 instructions

    Integrated shared-memory and message-passing communication in the Alewife multiprocessor

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D

    Processor Microarchitecture Security

    Get PDF
    As computer systems grow more and more complicated, various optimizations can unintentionally introduce security vulnerabilities in these systems. The vulnerabilities can lead to user information and data being compromised or stolen. In particular, the ending of both Moore\u27s law and Dennard scaling motivate the design of more exotic microarchitectural optimizations to extract more performance -- further exacerbating the security vulnerabilities. The performance optimizations often focus on sharing or re-using of hardware components within a processor, between different users or programs. Because of the sharing of the hardware, unintentional information leakage channels, through the shared components, can be created. Microarchitectural attacks, such as the high-profile Spectre and Meltdown attacks or the cache covert channels that they leverage, have demonstrated major vulnerabilities of modern computer architectures due to the microarchitectural~optimizations. Key components of processor microarchitectures are processor caches used for achieving high memory bandwidth and low latency for frequently accessed data. With frequently accessed data being brought and stored in caches, memory latency can be significantly reduced when data is fetched from the cache, as opposed to being fetched from the main memory. With limited processor chip area, however, the cache size cannot be very large. Thus, modern processors adopt a cache hierarchy with multiple levels of caches, where the cache close to processor is faster but smaller, and the cache far from processor is slower but larger. This leads to a fundamental property of modern processors: {\em the latency of accessing data in different cache levels and in main memory is different}. As a result, the timing of memory operations when fetching data from different cache levels, e.g., the timing of fetching data from closest-to-processor L1 cache vs. from main memory, can reveal secret-dependent information if attacker is able to observe the timing of these accesses and correlate them to the operation of the victim\u27s code. Further, due to limited size of the caches, memory accesses by a victim may displace attacker\u27s data from the cache, and with knowledge, or reverse-engineering, of the cache architecture, the attacker can learn some information about victim\u27s data based on the modifications to the state of the cache -- which can be observed by the timing~measurements. Caches are not only structures in the processor that can suffer from security vulnerabilities. As an essential mechanism to achieving high performance, cache-like structures are used pervasively in various processor components, such as the translation lookaside buffer (TLB) and processor frontend. Consequently, the vulnerabilities due to timing differences of accessing data in caches or cache-like structures affect many components of the~processor. The main goal of this dissertation is the {\em design of high performance and secure computer architectures}. Since the sophisticated hardware components such as caches, TLBs, value predictors, and processor frontend are critical to ensure high performance, realizing this goal requires developing fundamental techniques to guarantee security in the presence of timing differences of different processor operations. Furthermore, effective defence mechanisms can be only developed after developing a formal and systematic understanding of all the possible attacks that timing side-channels can lead to. To realize the research goals, the main main contributions of this dissertation~are: \begin{itemize}[noitemsep] \item Design and evaluation of a novel three-step cache timing model to understand theoretical vulnerabilities in caches \item Development of a benchmark suite that can test if processor caches or secure cache designs are vulnerable to certain theoretical vulnerabilities. \item Development of a timing vulnerability model to test TLBs and design of hardware defenses for the TLBs to address newly found vulnerabilities. \item Analysis of value predictor attacks and design of defenses for value predictors. \item Evaluation of vulnerabilities in processor frontends based on timing differences in the operation of the frontends. \item Development of a design-time security verification framework for secure processor architectures, using information flow tracking methods. \end{itemize} \newpage This dissertation combines the theoretical modeling and practical benchmarking analysis to help evaluate susceptibility of different architectures and microarchitectures to timing attacks on caches, TLBs, value predictors and processor frontend. Although cache timing side-channel attacks have been studied for more than a decade, there is no evidence that the previously-known attacks exhaustively cover all possible attacks. One of the initial research directions covered by this dissertation was to develop a model for cache timing attacks, which can help lead towards discovering all possible cache timing attacks. The proposed three-step cache timing vulnerability model provides a means to enumerate all possible interactions between the victim and attacker who are sharing a cache-like structure, producing the complete set of theoretical timing vulnerabilities. This dissertation also covers new theoretical cache timing attacks that are unknown prior to being found by the model. To make the advances in security not only theoretical, this dissertation also covers design of a benchmarking suite that runs on commodity processors and helps evaluate their cache\u27s susceptibility to attacks, as well as can run on simulators to test potential or future cache designs. As the dissertation later demonstrates, the three-step timing vulnerability model can be naturally applied to any cache-like structures such as TLBs, and the dissertation encompasses a three-step model for TLBs, uncovering of theoretical new TLB attacks, and proposals for defenses. Building on success of analyzing caches and TLBs for new timing attacks, this dissertation then discusses follow-on research on evaluation and uncovering of new timing vulnerabilities in processor frontends. Since security analysis should be applied not just to existing processor microarchitectural features, the dissertation further analyzes possible future features such as value predictors. Although not currently in use, value predictors are actively being researched and proposed for addition into future microarchitectures. This dissertation shows, however, that they are vulnerable to attacks. Lastly, based on findings of the security issues with existing and proposed processor features, this dissertation explores how to better design secure processors from ground up, and presents a design-time security verification framework for secure processor architectures, using information flow tracking methods

    Acta Cybernetica : Volume 16. Number 2.

    Get PDF

    An OS-Based Alternative to Full Hardware Coherence on Tiled Chip-Multiprocessors

    Get PDF
    Institute for Computing Systems ArchitectureThe interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This thesis proposes a novel, cost-effective hardware mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mech- anism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles. The proposed tiled CMP architecture is evaluated on the SPLASH-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed direc- tory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16% on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors

    Storage systems for mobile-cloud applications

    Get PDF
    Mobile devices have become the major computing platform in todays world. However, some apps on mobile devices still suffer from insufficient computing and energy resources. A key solution is to offload resource-demanding computing tasks from mobile devices to the cloud. This leads to a scenario where computing tasks in the same application run concurrently on both the mobile device and the cloud. This dissertation aims to ensure that the tasks in a mobile app that employs offloading can access and share files concurrently on the mobile and the cloud in a manner that is efficient, consistent, and transparent to locations. Existing distributed file systems and network file systems do not satisfy these requirements. Furthermore, current offloading platforms either do not support efficient file access for offloaded tasks or do not offload tasks with file accesses. The first part of the dissertation addresses this issue by designing and implementing an application-level file system named Overlay File System (OFS). OFS assumes a cloud surrogate is paired with each mobile device for task and storage offloading. To achieve high efficiency, OFS maintains and buffers local copies of data sets on both the surrogate and the mobile device. OFS ensures consistency and guarantees that all the reads get the latest data. To effectively reduce the network traffic and the execution delay, OFS uses a delayed-update mechanism, which combines write-invalidate and write-update policies. To guarantee location transparency, OFS creates a unified view of file data. The research tests OFS on Android OS with a real mobile application and real mobile user traces. Extensive experiments show that OFS can effectively support consistent file accesses from computation tasks, no matter where they run. In addition, OFS can effectively reduce both file access latency and network traffic incurred by file accesses. While OFS allows offloaded tasks to access the required files in a consistent and transparent manner, file accesses by offloaded tasks can be further improved. Instead of retrieving the required files from its associated mobile device, a surrogate can discover and retrieve identical or similar file(s) from the surrogates belonging to other users to meet its needs. This is based on two observations: 1) multiple users have the same or similar files, e.g., shared files or images/videos of same object; 2) the need for a certain file content in mobile apps can usually be described by context features of the content, e.g., location, objects in an image, etc.; thus, any file with the required context features can be used to satisfy the need. Since files may be retrieved from surrogates, this solution improves latency and saves wireless bandwidth and power on mobile devices. The second part of the dissertation proposes and develops a Context-Aware File Discovery Service (CAFDS) that implements the idea described above. CAFDS uses a self-organizing map and k-means clustering to classify files into file groups based on file contexts. It then uses an enhanced decision tree to locate and retrieve files based on the file contexts defined by apps. To support diverse file discovery demands from various mobile apps, CAFDS allows apps to add new file contexts and to update existing file contexts dynamically, without affecting the discovery process. To evaluate the effectiveness of CAFDS, the research has implemented a prototype on Android and Linux. The performance of CAFDS was tested against Chord, a DHT based lookup scheme, and SPOON, a P2P file sharing system. The experiments show that CAFDS provides lower end-to-end latency for file search than Chord and SPOON, while providing similar scalability to Chord

    Timing Predictability in Future Multi-Core Avionics Systems

    Full text link

    Automated detection of structured coarse-grained parallelism in sequential legacy applications

    Get PDF
    The efficient execution of sequential legacy applications on modern, parallel computer architectures is one of today’s most pressing problems. Automatic parallelization has been investigated as a potential solution for several decades but its success generally remains restricted to small niches of regular, array-based applications. This thesis investigates two techniques that have the potential to overcome these limitations. Beginning at the lowest level of abstraction, the binary executable, it presents a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed technique that takes advantage of an underlying multicore host to transparently parallelize a sequential binary executable. While still in its infancy, Dbp has received broad interest within the research community. This thesis seeks to gain an understanding of the factors contributing to the limits of Dbp and the costs and overheads of its implementation. An extensive evaluation using a parameterizable Dbp system targeting a Cmp with light-weight architectural Tls support is presented. The results show that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks, but that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average. While automatically parallelizing compilers have traditionally focused on data parallelism, additional parallelism exists in a plethora of other shapes such as task farms, divide & conquer, map/reduce and many more. These algorithmic skeletons, i.e. high-level abstractions for commonly used patterns of parallel computation, differ substantially from data parallel loops. Unfortunately, algorithmic skeletons are largely informal programming abstractions and are lacking a formal characterization in terms of established compiler concepts. This thesis develops compiler-friendly characterizations of popular algorithmic skeletons using a novel notion of commutativity based on liveness. A hybrid static/dynamic analysis framework for the context-sensitive detection of skeletons in legacy code that overcomes limitations of static analysis by complementing it with profiling information is described. A proof-of-concept implementation of this framework in the Llvm compiler infrastructure is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in nature. Like the two approaches presented in this thesis, many dynamic parallelization techniques exploit the fact that some statically detected data and control flow dependences do not manifest themselves in every possible program execution (may-dependences) but occur only infrequently, e.g. for some corner cases, or not at all for any legal program input. While the effectiveness of dynamic parallelization techniques critically depends on the absence of such dependences, not much is known about their nature. This thesis presents an empirical analysis and characterization of the variability of both data dependences and control flow across program runs. The cBench benchmark suite is run with 100 randomly chosen input data sets to generate whole-program control and data flow graphs (Cdfgs) for each run, which are then compared to obtain a measure of the variance in the observed control and data flow. The results show that, on average, the cumulative profile information gathered with at least 55, and up to 100, different input data sets is needed to achieve full coverage of the data flow observed across all runs. For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests that profile-guided parallelization needs to be applied with utmost care, as misclassification of sequential loops as parallel was observed even when up to 94 input data sets are used

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    NASA Space Engineering Research Center Symposium on VLSI Design

    Get PDF
    The NASA Space Engineering Research Center (SERC) is proud to offer, at its second symposium on VLSI design, presentations by an outstanding set of individuals from national laboratories and the electronics industry. These featured speakers share insights into next generation advances that will serve as a basis for future VLSI design. Questions of reliability in the space environment along with new directions in CAD and design are addressed by the featured speakers
    • …
    corecore