39 research outputs found

    Transactions Everywhere

    Get PDF
    Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality.Singapore-MIT Alliance (SMA

    LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version]

    Get PDF
    CORRECTION: The authors for entry [4] in the references should have been "E. S. Chung, J. C. Hoe, and K. Mai".Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer management in an H.264 decoder and memory management within a processor microarchitecture timing model

    Comparison of high level design methodologies for algorithmic IPs : Bluespec and C-based synthesis

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (leaves 37-39).High level hardware design of Digital Signal Processing algorithms is an important design problem for decreasing design time and allowing more algorithmic exploration. Bluespec is a Hardware Design Language (HDL) that allows designers to express intended microarchitecture through high-level constructs. C-based design tools directly generate hardware from algorithms expressed in C/C++. This research compares these two design methodologies in developing hardware for Reed-Solomon decoding algorithm under area and performance metrics. This work illustrates that C-based design flow may be effective in early stages of the design development for fast prototyping. However, the Bluespec design flow produces hardware that is more customized for performance and resource constraints. This is because in later stages, designers need to have close control over the hardware structure generated that is a part of HDLs like Bluespec, but is difficult to express under the constraints of sequential C semantics.by Abhinav Agarwal.S.M

    Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

    Get PDF
    As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM[superscript 2]), establishes a new design point. On the one hand, EM[superscript 2] supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM[superscript 2] chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM[superscript 2] is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months

    HieraGen: Automated Generation of Concurrent, Hierarchical Cache Coherence Protocols

    Get PDF

    Automatic generation of highly concurrent, hierarchical and heterogeneous cache coherence protocols from atomic specifications

    Get PDF
    Cache coherence protocols are often specified using only stable states and atomic transactions for a single cache hierarchy level. Designing highly-concurrent, hierarchical and heterogeneous directory cache coherence protocols from these atomic specifications for modern multicore architectures is a complicated task. To overcome these design challenges we have developed the novel *Gen algorithms (ProtoGen, HieraGen and HeteroGen). Using the *Gen algorithms highly-concurrent, hierarchical and heterogeneous cache coherence protocols can be automatically generated for a wide range of atomic input stable state protocol (SSP) speci fications - including the MOESI variants, as well as for protocols that are targeted towards Total Store Order and Release Consistency. In addition, for each *Gen algorithm we have developed and published an eponymous tool. The ProtoGen tool takes as input a single SSP (i.e., no concurrency) generating the corresponding protocol for a multicore architecture with non-atomic transactions. The ProtoGen algorithm automatically enforces the correct interleaving of conflicting coherence transactions for a given atomic coherence protocol specification. HieraGen is a tool for automatically generating hierarchical cache coherence protocols. Its inputs are SSPs for each level of the hierarchy and its output is a highly concurrent hierarchical protocol. HieraGen thus reduces the complexity that architects face by offloading the challenging task of composing protocols and managing concurrency. HeteroGen is a tool for automatically generating heterogeneous protocols that adhere to precise consistency models. As input, HeteroGen takes SSPs of the per-cluster coherence protocols, each of which satisfies its own per-cluster consistency model. The output is a concurrent (i.e., with transient states) heterogeneous protocol that satisfies a precisely defined consistency model that we refer to as a compound consistency model. To validate the correctness of the *Gen algorithms, the generated output protocols were verified for safety and deadlock freedom using a model checker. To verify the correctness of protocols that need to adhere to a specific compound consistency model generated by HeteroGen, novel litmus tests for multiple compound consistency models were developed. The protocols automatically generated using the *Gen tools have a comparable or better performance than manually generated cache coherence protocols, often discovering opportunities to reduce stalls. Thus, the *Gen tools reduce the complexity that architects face by offloading the challenging tasks of composing protocols and managing concurrency

    Scalable reconfigurable computing leveraging latency-insensitive channels

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 190-197).Traditionally, FPGAs have been confined to the limited role of small, low-volume ASIC replacements and as circuit emulators. However, continued Moore's law scaling has given FPGAs new life as accelerators for applications that map well to fine-grained parallel substrates. Examples of such applications include processor modelling, compression, and digital signal processing. Although FPGAs continue to increase in size, some interesting designs still fail to fit in to a single FPGA. Many tools exist that partition RTL descriptions across FPGAs. Unfortunately, existing tools have low performance due to the inefficiency of maintaining the cycle-by-cycle behavior of RTL among discrete FPGAs. These tools are unsuitable for use in FPGA program acceleration, as the purpose of an accelerator is to make applications run faster. This thesis presents latency-insensitive channels, a language-level mechanism by which programmers express points in their their design at which the cycle-by-cycle behavior of the design may be modified by the compiler. By decoupling the timing of portions of the RTL from the high-level function of the program, designs may be mapped to multiple FPGAs without suffering the performance degradation observed in existing tools. This thesis demonstrates, using a diverse set of large designs, that FPGA programs described in terms of latency-insensitive channels obtain significant gains in design feasibility, compilation time, and run-time when mapped to multiple FPGAs.by Kermin Elliott Fleming, Jr.Ph.D

    Rapid designs for cache coherence protocol engines in Bluespec

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 91-92).In this thesis, we present the framework for Rapid Protocol Engine Development (RaPED). We implemented the framework in Bluespec, which is a high level hardware language based on Term Rewriting Systems (TRSs). The framework is highly parameterized and general, thus allowing designers to design any protocol engine in a short period. Since protocol engines can be developed rapidly, designers can compare different designs instead of freezing the design prematurely in the development process. We used the RaPED to implement a cache coherence protocol for Shen and Arvind's Commit-Reconcile and Fences (CRF) memory model [1]. The CRF allows scalable implementations of shared memory systems by decomposing memory access operations into simpler instructions. However, the focus for Shen's Cachet protocol for the CRF was adaptivity and correctness, it ignored some important implementation issues such as cache-line replacement, efficient buffer management and compatibility with multiword cache lines. In this thesis, we present a protocol called the Multiword Base protocol, which avoids these limitations. We defined the Multi-word CRF (MCRF) memory model to help us to prove the correctness of Multiword Base. The MCRF is a specialization of the CRF with modifications that summarizes the properties of multiword cache lines. We show that Multiword Base is a correct implementation of the CRF by using the MCRF to simulate Multiword Base. Apart from using multiword cache lines, many cache coherence protocols allow a cache to get data directly from another cache. The caches having this property is calling the snoopy caches. In this thesis, we present a CRF variant called the Snoopy CRF (SCRF) memory model, which gives hints to incorporate snoopy caches to the implementations of the CRF.by Man Cheuk Ng.S.M
    corecore