35 research outputs found

    Hardware support for unbounded transactional memory

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 107-111).In this thesis, I propose a design for hardware transactional memory where the transaction size is not bounded by a specialized hardware buffer such as a cache. I describe an unbounded transactional memory system called UTM (unbounded transactional memory) that exploits the perceived common case where transactions are small but still supports transactions of arbitrary size. As in previous hardware transactional memory systems, UTM uses the cache to store speculative state and uses the cache coherency protocol to detect conflicting transactions. Unlike previous hardware systems, UTM allows the speculative state to overflow from the cache into main memory, thereby allowing the transaction to grow beyond the size limitation of the cache. The clean semantics of UTM allow nested transaction support, nontransactional instructions, immediate aborts, a processor snapshot, and context-switching support; all features not found in previous hardware transactional systems. UTM was implemented in a detailed simulator, and experimental results show that it can be integrated with existing hardware straightforwardly while still performing better than conventional synchronization techniques.by Sean Lie.M.Eng

    Hardware Transactional Memory

    Get PDF
    This work shows how hardware transactional memory (HTM) can be implemented to support transactions of arbitrarily large size, while ensuring that small transactions run efficiently. Our implementation handles small transactions similar to Herlihy and Moss's scheme in that it holds tentative updates in a cache. Unlike their scheme, which uses a special fully associative cache, ours augments the ordinary processor cache and provides a mechanism to handle cache spills of uncommitted transactional data. Consequently, our scheme runs faster for small transactions while correctly handling transactions of arbitrarily large size. Although transactions are small in the common case, we argue that HTM should not restrict the size of transactions, because it complicates the programmer/compiler model and precludes some important programs from exploiting transactional memory. We show that the Linux 2.4.19 kernel can be automatically and efficiently “transactified” if boundless transactions can be supported. Our experimental results show that the largest transaction touches over 7000 64-byte cache lines, whereas 99.94\% of the transactions touch fewer than 64 cache lines. We further show that synchronized methods in Java can be easily compiled to our HTM scheme, thereby providing the advantages of nonblocking atomicity (including absence of deadlock) in a straightforward fashion. Our HTM scheme for boundless transactions uses an efficiently implementable hardware snapshot and the ordinary set-associative L2 cache extended with less than two bits per cache line. One of the bits tells whether the cached item is part of a transaction (as in the Herlihy-Moss scheme), and all the lines in an associative set share another bit telling whether a line has overflowed from the cache and is now stored in a special overflow area of main memory. We provide empirical results to show that our scheme does not adversely affect the processor pipeline or hinder speculative execution.Singapore-MIT Alliance (SMA

    SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

    Full text link
    The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.Comment: Accepted to Uncertainty in Artificial Intelligence (UAI) 2023 Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary Material

    Influence of dietary conjugated linoleic acid (CLA) and tetradecylthioacetic acid (TTA) on growth, lipid composition and key enzymes of fatty acid oxidation in liver and muscle of Atlantic cod (Gadus morhua L.)

    Get PDF
    The aim of the present study was to determine the effects of conjugated linoleic acid (CLA) and tetradecylthioacetic acid (TTA) on growth performance, and lipid and fatty acid metabolism in Atlantic cod. The overall objective being to test the hypotheses that CLA and TTA have beneficial effects in cod culture including decreased liver size and proportion through decreased lipid content, and increased nutritional quality through effects on fatty acid compositions including accumulation of bioactive fatty acids, CLA and TTA, in flesh. Juvenile cod were fed for three months on fish meal and fish oil diets of basically commercial formulation, but containing either 0.5% or 1% CLA, or 0.5% TTA. The effects of the functional fatty acids on growth, feed efficiency, body proximate composition, liver weight and lipid composition, fatty acid compositions of flesh and liver, and key enzymes of fatty acid oxidation were determined. Dietary CLA and TTA had no effect on growth parameters in cod juveniles, but viscero- and hepato-somatic indices were increased in fish fed 0.5% CLA and TTA, respectively. Proximate composition of whole fish was not affected by CLA or TTA, and there were no major effects of either functional fatty acid on lipid contents and compositions of liver and flesh. Dietary CLA and TTA were both incorporated into tissue lipids, with CLA deposited to a greater extent in liver, whereas TTA was deposited to a greater extent in flesh. In liver, acyl CoA oxidase (ACO) activity, but not carnitine palmitoyltransferase-I (CPT-I), was increased by CLA, whereas dietary TTA increased both ACO and CPT-I activities. In contrast, ACO activity was reduced by both CLA and TTA in red and white muscle, whereas CPT-I activity was generally not affected by CLA and TTA in either muscle tissue. Therefore, the results only partially supported the hypotheses tested, as CLA and TTA had few beneficial effects in Atlantic cod and did not enhance growth parameters, or improve feed conversion or potential yield through decreased adiposity or liver lipid deposition. However, nutritional quality could be enhanced, and cod fed CLA and/or TTA could be beneficial in the human diet, through provision of bioactive fatty acids with no detrimental effects on n-3 PUFA levels

    Unbounded Transactional Memory

    No full text
    Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. We performed a cycleaccurate simulation of a simplified architecture, called LTM. LTM is based on UTM but is easier to implement, because it does not change the memory subsystem outside of the processor. LTM allows nearly unbounded transactions, whose footprint is limited only by physical memory size and whose duration by the length of a timeslice. We assess UTM and LTM through microbenchmarking and by automatically converting the SPECjvm98 Java benchmarks and the Linux 2.4.19 kernel to use transactions instead of locks. We use both cycle-accurate simulation and instrumentation to understand benchmark behavior. Our studies show that the common case is small transactions that commit, even when contention is high, but that some applications contain very large transactions. For example, although 99.9 % of transactions in the Linux study touch 54 cache lines or fewer, some transactions touch over 8000 cache lines. Our studies also indicate that hardware support is required, because some applications spend over half their time in critical regions. Finally, they suggest that hardware support for transactions can make Java programs run faster than when run using locks and can increase the concurrency of the Linux kernel by as much as a factor of 4 with no additional programming work. 1
    corecore