7 research outputs found

    Design and Code Optimization for Systems with Next-generation Racetrack Memories

    Get PDF
    With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

    How Fast Can We Play Tetris Greedily With Rectangular Pieces?

    Get PDF
    Consider a variant of Tetris played on a board of width ww and infinite height, where the pieces are axis-aligned rectangles of arbitrary integer dimensions, the pieces can only be moved before letting them drop, and a row does not disappear once it is full. Suppose we want to follow a greedy strategy: let each rectangle fall where it will end up the lowest given the current state of the board. To do so, we want a data structure which can always suggest a greedy move. In other words, we want a data structure which maintains a set of O(n)O(n) rectangles, supports queries which return where to drop the rectangle, and updates which insert a rectangle dropped at a certain position and return the height of the highest point in the updated set of rectangles. We show via a reduction to the Multiphase problem [P\u{a}tra\c{s}cu, 2010] that on a board of width w=Θ(n)w=\Theta(n), if the OMv conjecture [Henzinger et al., 2015] is true, then both operations cannot be supported in time O(n1/2ϵ)O(n^{1/2-\epsilon}) simultaneously. The reduction also implies polynomial bounds from the 3-SUM conjecture and the APSP conjecture. On the other hand, we show that there is a data structure supporting both operations in O(n1/2log3/2n)O(n^{1/2}\log^{3/2}n) time on boards of width nO(1)n^{O(1)}, matching the lower bound up to a no(1)n^{o(1)} factor.Comment: Correction of typos and other minor correction

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Statically analyzing the energy efficiency of software product lines

    Get PDF
    Optimizing software to become (more) energy efficient is an important concern for the software industry. Although several techniques have been proposed to measure energy consumption within software engineering, little work has specifically addressed Software Product Lines (SPLs). SPLs are a widely used software development approach, where the core concept is to study the systematic development of products that can be deployed in a variable way, e.g., to include different features for different clients. The traditional approach for measuring energy consumption in SPLs is to generate and individually measure all products, which, given their large number, is impractical. We present a technique, implemented in a tool, to statically estimate the worst-case energy consumption for SPLs. The goal is to reason about energy consumption in all products of a SPL, without having to individually analyze each product. Our technique combines static analysis and worst-case prediction with energy consumption analysis, in order to analyze products in a feature-sensitive manner: a feature that is used in several products is analyzed only once, while the energy consumption is estimated once per product. This paper describes not only our previous work on worst-case prediction, for comprehensibility, but also a significant extension of such work. This extension has been realized in two different axis: firstly, we incorporated in our methodology a simulated annealing algorithm to improve our worst-case energy consumption estimation. Secondly, we evaluated our new approach in four real-world SPLs, containing a total of 99 software products. Our new results show that our technique is able to estimate the worst-case energy consumption with a mean error percentage of 17.3% and standard deviation of 11.2%.This paper acknowledges the support of the Erasmus+ Key Action 2 (Strategic partnership for higher education) project No. 2020-1-PT01-KA203-078646: SusTrainable-Promoting Sustainability as a Fundamental Driver in Software Development Training and Education

    Generalized strictly periodic scheduling analysis, resource optimization, and implementation of adaptive streaming applications

    Get PDF
    This thesis focuses on addressing four research problems in designing embedded streaming systems. Embedded streaming systems are those systems thatprocess a stream of input data coming from the environment and generate a stream of output data going into the environment. For many embeddedstreaming systems, the timing is a critical design requirement, in which the correct behavior depends on both the correctness of output data and on the time at which the data is produced. An embedded streaming system subjected to such a timing requirement is called a real-time system. Some examples of real-time embedded streaming systems can be found in various autonomous mobile systems, such as planes, self-driving cars, and drones. To handle the tight timing requirements of such real-time embedded streaming systems, modern embedded systems have been equipped with hardware platforms, the so-called Multi-Processor Systems-on-Chip (MPSoC), that contain multiple processors, memories, interconnections, and other hardware peripherals on a single chip, to benefit from parallel execution. To efficiently exploit the computational capacity of an MPSoC platform, a streaming application which is going to be executed on the MPSoC platform must be expressed primarily in a parallel fashion, i.e., the application is represented as a set of parallel executing and communicating tasks. Then, the main challenge is how to schedule the tasks spatially, i.e., task mapping, and temporally, i.e., task scheduling, on the MPSoC platform such that all timing requirements are satisfied while making efficient utilization of available resources (e.g, processors, memory, energy, etc.) on the platform. Another challenge is how to implement and run the mapped and scheduled application tasks on the MPSoC platform. This thesis proposes several techniques to address the aforementioned two challenges.NWOComputer Systems, Imagery and Medi
    corecore