Search CORE

19 research outputs found

Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling

Author: Ganesh Brinda
Jacob Bruce
Jaleel Aamer
Wang David
Publication venue
Publication date: 01/01/2007
Field of study

Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a Fully- Buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed

Crossref

Digital Repository at the University of Maryland

DRAMsim: A Memory System Simulator

Author: Baynes Kathleen
Ganesh Brinda
Jacob Bruce
Jaleel Aamer
Tuaycharoen Nuengwong
Wang David
Publication venue: ACM (Association for Computing Machinery) Publications
Publication date: 01/09/2005
Field of study

As memory accesses become slower with respect to the processor and consume more power with increasing memory size, the focus of memory performance and power consumption has become increasingly important. With the trend to develop multi-threaded, multi-core processors, the demands on the memory system will continue to scale. However, determining the optimal memory system configuration is non-trivial. The memory system performance is sensitive to a large number of parameters. Each of these parameters take on a number of values and interact in fashions that make overall trends difficult to discern. A comparison of the memory system architectures becomes even harder when we add the dimensions of power consumption and manufacturing cost. Unfortunately, there is a lack of tools in the public-domain that support such studies. Therefore, we introduce DRAMsim, a detailed and highly configurable C-based memory system simulator to fill this gap. DRAMsim implements detailed timing models for a variety of existing memories, including SDRAM, DDR, DDR2, DRDRAM and FB-DIMM, with the capability to easily vary their parameters. It also models the power consumption of SDRAM and its derivatives. It can be used as a standalone simulator or as part of a more comprehensive system-level model. We have successfully integrated DRAMsim into a variety of simulators including MASE[15], Sim-alpha[14], BOCHS[2] and GEMS[13]. The simulator can be downloaded from www.ece.umd.edu/dramsim

Digital Repository at the University of Maryland

The Performance and Energy Consumption of Embedded Real-Time Operating Systems

Author: Baynes Kathleen
Collins Chris
Fiterma Eric
Ganesh Brinda
Jacob Bruce
Kohout Paul
Smit Christine
Zhang Tiebin
Publication venue
Publication date: 01/11/2003
Field of study

This paper presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: #1;C/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multirate task scheduler reminiscent of typical “roll-your-own” RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption—energy that could be saved by a simple hardware sleep mechanism

Digital Repository at the University of Maryland

Assortative mixing in Protein Contact Networks and protein folding kinetics

Author: Albert
Alm
Amaral
Amitai
Anfinsen
Aszódi
Atilgan
Bagler
Berman
Bollobás
Branden
Brinda
Dokholyan
Dorogovtsev
Epand
Fersht
Ganesh Bagler
Greene
Gromiha
Jung
Larson
Levinthal
Maity
Maslov
Mirny
Murzin
Newman
Newman
Nölting
Paci
Palla
Pastor-Satorras
Plaxco
Rao
Scalley-Kim
Somdatta Sinha
Taubes
Vendruscolo
Watts
Xulvi-Brunet
Zhou
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2007
Field of study

Starting from linear chains of amino acids, the spontaneous folding of proteins into their elaborate three-dimensional structures is one of the remarkable examples of biological self-organization. We investigated native state structures of 30 single-domain, two-state proteins, from complex networks perspective, to understand the role of topological parameters in proteins' folding kinetics, at two length scales-- as ``Protein Contact Networks (PCNs)'' and their corresponding ``Long-range Interaction Networks (LINs)'' constructed by ignoring the short-range interactions. Our results show that, both PCNs and LINs exhibit the exceptional topological property of ``assortative mixing'' that is absent in all other biological and technological networks studied so far. We show that the degree distribution of these contact networks is partly responsible for the observed assortativity. The coefficient of assortativity also shows a positive correlation with the rate of protein folding at both short and long contact scale, whereas, the clustering coefficients of only the LINs exhibit a negative correlation. The results indicate that the general topological parameters of these naturally-evolved protein networks can effectively represent the structural and functional properties required for fast information transfer among the residues facilitating biochemical/kinetic functions, such as, allostery, stability, and the rate of folding.Comment: Published in Bioinformatic

arXiv.org e-Print Archive

Crossref

ARCHITECTURAL SUPPORT FOR EMBEDDED OPERATING SYSTEMS

Author: Brinda Ganesh
Degree C
Idate Brinda Ganesh
Publication venue
Publication date
Field of study

This thesis investigates hardware support for managing time, events, and process scheduling in embedded operating systems. An otherwise normal content-addressable memory that is tailored to handle the most basic func-tions of a typical RTOS, the CCAM (configurable content-addressable memory) turns what are usually O(n) tasks into O(1) tasks using the paral-lelism inherent in a hardware search implementation. The mechanism is modelled in the context of the MCORE embedded microarchitecture, sev-eral variations upon µC/OS-II, a popular open-source real-time operating system and Echidna, a commercial real-time operating system. The mecha-nism improves the real-time behavior of systems by reducing the overhead of the RTOS by 20 % and in some cases reduces energy consumption 25%. This latter feature is due to the reduced number of instructions fetched and executed, even though the energy cost of one CCAM access is much higher than the energy cost of a single instruction. The performance and energy benefits come with a modest price: an increase in die area of roughly 10%. The CCAM is orthogonal to the instruction set (it is accessed via memory-mapped I/O load/store instructions) and offers features used by most RTOSes

CiteSeerX

Understanding and Optimizing High-Speed Serial Memory System Protocols

Author: Ganesh Brinda
Publication venue
Publication date: 09/05/2007
Field of study

Performance improvements in memory systems have traditionally been obtained by scaling data bus width and speed. Maintaining this trend while continuing to satisfy memory capacity demands of server systems is challenging due to the electrical constraints posed by high-speed parallel buses. To satisfy the dual needs of memory bandwidth and memory system capacity, new memory system protocols have been proposed by the leaders in the memory system industry. These protocols replace the conventional memory bus interface between the memory controller and the memory modules with narrow, high-speed, uni-directional point-to point interfaces. The memory controller communicates with the memory modules using a packet-based protocol, which is translated to the conventional DRAM commands at the memory modules. Memory latency has been widely accepted as one of the key performance bottlenecks in computer architecture. Hence, any changes to memory sub-system architecture and protocol can have a significant impact on overall system performance. In the first part of this dissertation, we did an extensive study and analysis of how the behavior of newly proposed memory architecture to identify clearly how it impacts memory sub-system performance and what the key performance limiters are. We then went on to use the insights we gained from this analysis to propose two optimization techniques focussed on improving the performance of the memory system. We first evaluated the performance of the current de facto serial memory system standard, FBDIMM (Fully Buffered DIMM) with respect to the conventional wide-bus architectures that have been in use for decades. We found that the relative performance of a FBDIMM system with respect to a conventional DDRx system was a strong function of the bandwidth utilization, with FBDIMM systems doing worse in low utilization systems and often out-performing DDRx systems at higher system utilizations. More interestingly, we found that many of the memory controller policies that have been in use in DDRx systems performed similarly on a FBDIMM system. Memory latency typically has a significant impact on overall system performance. FBDIMM systems, by using daisy chaining and serialization, increase the default latency cost of a memory transaction. In a longer memory channel, i.e. a channel with 8 DIMMs of memory, inefficient link utilization and memory controller scheduling policies can contribute to a further reduction in system performance. We propose two main optimization techniques to tackle these inefficiencies - reordering data on the return link and buffering at the memory module. Both these policies lower read latency by 10-20% and improve application performance by 2-25%

Digital Repository at the University of Maryland

Hardware support for real-time operating systems

Author: Brinda Ganesh
Bruce Jacob
Paul Kohout
Publication venue: ACM Press
Publication date: 01/01/2003
Field of study

1 Work done when Paul was at UMD. The growing complexity of embedded applications and pressure on time-to-market has resulted in the increasing use of embedded real-time operating systems. Unfortunately, RTOSes can introduce a significant performance degradation. This paper presents the Real-Time Task Manager (RTM)—a processor extension that minimizes the performance drawbacks associated with RTOSes. The RTM accomplishes this by supporting, in hardware, a few of the common RTOS operations that are performance bottlenecks: task scheduling, time management, and event management. By exploiting the inherent parallelism of these operations, the RTM completes them in constant time, thereby significantly reducing RTOS overhead. It decreases both the processor time used by the RTOS and the maximum response time by an order of magnitude

CiteSeerX