19 research outputs found

    Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling

    Get PDF
    Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a Fully- Buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed

    DRAMsim: A Memory System Simulator

    Get PDF
    As memory accesses become slower with respect to the processor and consume more power with increasing memory size, the focus of memory performance and power consumption has become increasingly important. With the trend to develop multi-threaded, multi-core processors, the demands on the memory system will continue to scale. However, determining the optimal memory system configuration is non-trivial. The memory system performance is sensitive to a large number of parameters. Each of these parameters take on a number of values and interact in fashions that make overall trends difficult to discern. A comparison of the memory system architectures becomes even harder when we add the dimensions of power consumption and manufacturing cost. Unfortunately, there is a lack of tools in the public-domain that support such studies. Therefore, we introduce DRAMsim, a detailed and highly configurable C-based memory system simulator to fill this gap. DRAMsim implements detailed timing models for a variety of existing memories, including SDRAM, DDR, DDR2, DRDRAM and FB-DIMM, with the capability to easily vary their parameters. It also models the power consumption of SDRAM and its derivatives. It can be used as a standalone simulator or as part of a more comprehensive system-level model. We have successfully integrated DRAMsim into a variety of simulators including MASE[15], Sim-alpha[14], BOCHS[2] and GEMS[13]. The simulator can be downloaded from www.ece.umd.edu/dramsim

    The Performance and Energy Consumption of Embedded Real-Time Operating Systems

    Get PDF
    This paper presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: #1;C/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multirate task scheduler reminiscent of typical “roll-your-own” RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption—energy that could be saved by a simple hardware sleep mechanism

    Assortative mixing in Protein Contact Networks and protein folding kinetics

    Get PDF
    Starting from linear chains of amino acids, the spontaneous folding of proteins into their elaborate three-dimensional structures is one of the remarkable examples of biological self-organization. We investigated native state structures of 30 single-domain, two-state proteins, from complex networks perspective, to understand the role of topological parameters in proteins' folding kinetics, at two length scales-- as ``Protein Contact Networks (PCNs)'' and their corresponding ``Long-range Interaction Networks (LINs)'' constructed by ignoring the short-range interactions. Our results show that, both PCNs and LINs exhibit the exceptional topological property of ``assortative mixing'' that is absent in all other biological and technological networks studied so far. We show that the degree distribution of these contact networks is partly responsible for the observed assortativity. The coefficient of assortativity also shows a positive correlation with the rate of protein folding at both short and long contact scale, whereas, the clustering coefficients of only the LINs exhibit a negative correlation. The results indicate that the general topological parameters of these naturally-evolved protein networks can effectively represent the structural and functional properties required for fast information transfer among the residues facilitating biochemical/kinetic functions, such as, allostery, stability, and the rate of folding.Comment: Published in Bioinformatic

    ARCHITECTURAL SUPPORT FOR EMBEDDED OPERATING SYSTEMS

    No full text
    This thesis investigates hardware support for managing time, events, and process scheduling in embedded operating systems. An otherwise normal content-addressable memory that is tailored to handle the most basic func-tions of a typical RTOS, the CCAM (configurable content-addressable memory) turns what are usually O(n) tasks into O(1) tasks using the paral-lelism inherent in a hardware search implementation. The mechanism is modelled in the context of the MCORE embedded microarchitecture, sev-eral variations upon µC/OS-II, a popular open-source real-time operating system and Echidna, a commercial real-time operating system. The mecha-nism improves the real-time behavior of systems by reducing the overhead of the RTOS by 20 % and in some cases reduces energy consumption 25%. This latter feature is due to the reduced number of instructions fetched and executed, even though the energy cost of one CCAM access is much higher than the energy cost of a single instruction. The performance and energy benefits come with a modest price: an increase in die area of roughly 10%. The CCAM is orthogonal to the instruction set (it is accessed via memory-mapped I/O load/store instructions) and offers features used by most RTOSes

    Understanding and Optimizing High-Speed Serial Memory System Protocols

    No full text
    Performance improvements in memory systems have traditionally been obtained by scaling data bus width and speed. Maintaining this trend while continuing to satisfy memory capacity demands of server systems is challenging due to the electrical constraints posed by high-speed parallel buses. To satisfy the dual needs of memory bandwidth and memory system capacity, new memory system protocols have been proposed by the leaders in the memory system industry. These protocols replace the conventional memory bus interface between the memory controller and the memory modules with narrow, high-speed, uni-directional point-to point interfaces. The memory controller communicates with the memory modules using a packet-based protocol, which is translated to the conventional DRAM commands at the memory modules. Memory latency has been widely accepted as one of the key performance bottlenecks in computer architecture. Hence, any changes to memory sub-system architecture and protocol can have a significant impact on overall system performance. In the first part of this dissertation, we did an extensive study and analysis of how the behavior of newly proposed memory architecture to identify clearly how it impacts memory sub-system performance and what the key performance limiters are. We then went on to use the insights we gained from this analysis to propose two optimization techniques focussed on improving the performance of the memory system. We first evaluated the performance of the current de facto serial memory system standard, FBDIMM (Fully Buffered DIMM) with respect to the conventional wide-bus architectures that have been in use for decades. We found that the relative performance of a FBDIMM system with respect to a conventional DDRx system was a strong function of the bandwidth utilization, with FBDIMM systems doing worse in low utilization systems and often out-performing DDRx systems at higher system utilizations. More interestingly, we found that many of the memory controller policies that have been in use in DDRx systems performed similarly on a FBDIMM system. Memory latency typically has a significant impact on overall system performance. FBDIMM systems, by using daisy chaining and serialization, increase the default latency cost of a memory transaction. In a longer memory channel, i.e. a channel with 8 DIMMs of memory, inefficient link utilization and memory controller scheduling policies can contribute to a further reduction in system performance. We propose two main optimization techniques to tackle these inefficiencies - reordering data on the return link and buffering at the memory module. Both these policies lower read latency by 10-20% and improve application performance by 2-25%

    Hardware support for real-time operating systems

    No full text
    1 Work done when Paul was at UMD. The growing complexity of embedded applications and pressure on time-to-market has resulted in the increasing use of embedded real-time operating systems. Unfortunately, RTOSes can introduce a significant performance degradation. This paper presents the Real-Time Task Manager (RTM)—a processor extension that minimizes the performance drawbacks associated with RTOSes. The RTM accomplishes this by supporting, in hardware, a few of the common RTOS operations that are performance bottlenecks: task scheduling, time management, and event management. By exploiting the inherent parallelism of these operations, the RTM completes them in constant time, thereby significantly reducing RTOS overhead. It decreases both the processor time used by the RTOS and the maximum response time by an order of magnitude
    corecore