19 research outputs found

    Study of design tradeoffs of DRAM and SRAM memories, using HSPICE computer simulation

    Get PDF
    Semiconductor random access memories are complex systems that can be described by performance parameters such as memory cycle time, access delays, storage capacity, bit packing density, chip area and retention time. In this thesis, tradeoffs between cycle time, chip area, and storage size as reflected by bit line capacitance (Cbl) were studied as a function of particular design variables: memory cell capacitance (Cc); CMOS flip-flop sense amplifier (SA) transistor sizes; and size of precharge (PC), and word line (WL) switches. Performance was optimized using circuit simulation software, HSPICE, to observe DRAM and SRAM waveforms. With TSMC 0.18 micron technology, minimum cycle times of 2.1 ins (DRAM) and 1.1 8ns (SRAM) were achieved (Cbl = 100FF), by optimizing the kr values of the SA transistors, for a fixed SA area of 1 micrometer and finding the optimum PC switch width (1 .6 micrometer). To maintain the same cycle time when the Cbl of both SRAM and DRAM increased by N, the required total chip area was found to be increased by N2. For a constant memory capacity, the ratio of the change in the sense amplifier area to the change in memory cycle time for DRAM was found to be between 1.25 to three times that of SRAM, varying somewhat with cycle time. To optimize SRAM cycle time, the criteria of a bit line difference of 10% of 3V determined the time to terminate the connection of the bit line to the SRAM cell so as to avoid the loading of the parasitic Cc cell by the larger Cbl

    Energy Efficient Neocortex-Inspired Systems with On-Device Learning

    Get PDF
    Shifting the compute workloads from cloud toward edge devices can significantly improve the overall latency for inference and learning. On the contrary this paradigm shift exacerbates the resource constraints on the edge devices. Neuromorphic computing architectures, inspired by the neural processes, are natural substrates for edge devices. They offer co-located memory, in-situ training, energy efficiency, high memory density, and compute capacity in a small form factor. Owing to these features, in the recent past, there has been a rapid proliferation of hybrid CMOS/Memristor neuromorphic computing systems. However, most of these systems offer limited plasticity, target either spatial or temporal input streams, and are not demonstrated on large scale heterogeneous tasks. There is a critical knowledge gap in designing scalable neuromorphic systems that can support hybrid plasticity for spatio-temporal input streams on edge devices. This research proposes Pyragrid, a low latency and energy efficient neuromorphic computing system for processing spatio-temporal information natively on the edge. Pyragrid is a full-scale custom hybrid CMOS/Memristor architecture with analog computational modules and an underlying digital communication scheme. Pyragrid is designed for hierarchical temporal memory, a biomimetic sequence memory algorithm inspired by the neocortex. It features a novel synthetic synapses representation that enables dynamic synaptic pathways with reduced memory usage and interconnects. The dynamic growth in the synaptic pathways is emulated in the memristor device physical behavior, while the synaptic modulation is enabled through a custom training scheme optimized for area and power. Pyragrid features data reuse, in-memory computing, and event-driven sparse local computing to reduce data movement by ~44x and maximize system throughput and power efficiency by ~3x and ~161x over custom CMOS digital design. The innate sparsity in Pyragrid results in overall robustness to noise and device failure, particularly when processing visual input and predicting time series sequences. Porting the proposed system on edge devices can enhance their computational capability, response time, and battery life

    Predicting power scalability in a reconfigurable platform

    Get PDF
    This thesis focuses on the evolution of digital hardware systems. A reconfigurable platform is proposed and analysed based on thin-body, fully-depleted silicon-on-insulator Schottky-barrier transistors with metal gates and silicide source/drain (TBFDSBSOI). These offer the potential for simplified processing that will allow them to reach ultimate nanoscale gate dimensions. Technology CAD was used to show that the threshold voltage in TBFDSBSOI devices will be controllable by gate potentials that scale down with the channel dimensions while remaining within appropriate gate reliability limits. SPICE simulations determined that the magnitude of the threshold shift predicted by TCAD software would be sufficient to control the logic configuration of a simple, regular array of these TBFDSBSOI transistors as well as to constrain its overall subthreshold power growth. Using these devices, a reconfigurable platform is proposed based on a regular 6-input, 6-output NOR LUT block in which the logic and configuration functions of the array are mapped onto separate gates of the double-gate device. A new analytic model of the relationship between power (P), area (A) and performance (T) has been developed based on a simple VLSI complexity metric of the form ATσ = constant. As σ defines the performance “return” gained as a result of an increase in area, it also represents a bound on the architectural options available in power-scalable digital systems. This analytic model was used to determine that simple computing functions mapped to the reconfigurable platform will exhibit continuous power-area-performance scaling behavior. A number of simple arithmetic circuits were mapped to the array and their delay and subthreshold leakage analysed over a representative range of supply and threshold voltages, thus determining a worse-case range for the device/circuit-level parameters of the model. Finally, an architectural simulation was built in VHDL-AMS. The frequency scaling described by σ, combined with the device/circuit-level parameters predicts the overall power and performance scaling of parallel architectures mapped to the array

    Modern DRAM Memory Systems: Performance Analysis and Scheduling Algorithm

    Get PDF
    The performance characteristics of modern DRAM memory systems are impacted by two primary attributes: device datarate and row cycle time. Modern DRAM device datarates and row cycle times are scaling at different rates with each successive generation of DRAM devices. As a result, the performance characteristics of modern DRAM memory systems are becoming more difficult to evaluate at the same time that they are increasingly limiting the performance of modern computer systems. In this work, a performance evaluation framework that enables abstract performance analysis of DRAM memory systems is presented. The performance evaluation framework enables the performance characterization of memory systems while fully accounting for the effects of datarates, row cycle times, protocol overheads, device power constraints, and memory system organizations. This dissertation utilizes the described evaluation framework to examine the performance impact of the number of banks per DRAM device, the effects of relatively static DRAM row cycle times and increasing DRAM device datarates, power limitation constraints, and data burst lengths in future generations of DRAM devices. Simulation results obtained in the analysis provide insights into DRAM memory system performance characteristics including, but not limited to the following observations. The performance benefit of having a 16 banks over 8 banks increases with increasing datarate. The average performance benefit reaches 18% at 1 Gbps for both open-page and close-page systems. Close-page systems are greatly limited by DRAM device power constraints, while open-page systems are less sensitive to DRAM device power constraints. Increasing burst lengths of future DRAM devices can adversely impact cache-limited processors despite the increasing bandwidth. Performance losses of greater than 50% are observed. Finally, This dissertation also present a unique rank hopping DRAM command-scheduling algorithm designed to alleviate the bandwidth constraints in DDR2 and future DDRx SDRAM memory systems. The proposed rank hopping scheduling algorithm schedules DRAM transactions and command sequences to avoid the power limiting constraints and amortizes the rank-to-rank switching overhead. Execution based simulations show that some workloads are able to fully utilize the additional bandwidth and significant performance improvements are observed across a range of workloads

    Hardware Architectures and Implementations for Associative Memories : the Building Blocks of Hierarchically Distributed Memories

    Get PDF
    During the past several decades, the semiconductor industry has grown into a global industry with revenues around $300 billion. Intel no longer relies on only transistor scaling for higher CPU performance, but instead, focuses more on multiple cores on a single die. It has been projected that in 2016 most CMOS circuits will be manufactured with 22 nm process. The CMOS circuits will have a large number of defects. Especially when the transistor goes below sub-micron, the original deterministic circuits will start having probabilistic characteristics. Hence, it would be challenging to map traditional computational models onto probabilistic circuits, suggesting a need for fault-tolerant computational algorithms. Biologically inspired algorithms, or associative memories (AMs)—the building blocks of cortical hierarchically distributed memories (HDMs) discussed in this dissertation, exhibit a remarkable match to the nano-scale electronics, besides having great fault-tolerance ability. Research on the potential mapping of the HDM onto CMOL (hybrid CMOS/nanoelectronic circuits) nanogrids provides useful insight into the development of non-von Neumann neuromorphic architectures and semiconductor industry. In this dissertation, we investigated the implementations of AMs on different hardware platforms, including microprocessor based personal computer (PC), PC cluster, field programmable gate arrays (FPGA), CMOS, and CMOL nanogrids. We studied two types of neural associative memory models, with and without temporal information. In this research, we first decomposed the computational models into basic and common operations, such as matrix-vector inner-product and k-winners-take-all (k-WTA). We then analyzed the baseline performance/price ratio of implementing the AMs with a PC. We continued with a similar performance/price analysis of the implementations on more parallel hardware platforms, such as PC cluster and FPGA. However, the majority of the research emphasized on the implementations with all digital and mixed-signal full-custom CMOS and CMOL nanogrids. In this dissertation, we draw the conclusion that the mixed-signal CMOL nanogrids exhibit the best performance/price ratio over other hardware platforms. We also highlighted some of the trade-offs between dedicated and virtualized hardware circuits for the HDM models. A simple time-multiplexing scheme for the digital CMOS implementations can achieve comparable throughput as the mixed-signal CMOL nanogrids

    Topical Workshop on Electronics for Particle Physics

    Get PDF
    The purpose of the workshop was to present results and original concepts for electronics research and development relevant to particle physics experiments as well as accelerator and beam instrumentation at future facilities; to review the status of electronics for the LHC experiments; to identify and encourage common efforts for the development of electronics; and to promote information exchange and collaboration in the relevant engineering and physics communities

    Scale-Out Processors

    Get PDF
    Global-scale online services, such as Google’s Web search and Facebook’s social networking, run in large-scale datacenters. Due to their massive scale, these services are designed to scale out (or distribute) their respective loads and datasets across thousands of servers in datacenters. The growing demand for online services forced service providers to build networks of datacenters, which require an enormous capital outlay for infrastructure, hardware, and power consumption. Consequently, efficiency has become a major concern in the design and operation of such datacenters, with processor efficiency being of, utmost importance, due to the significant contribution of processors to the overall datacenter performance and cost. Scale-out workloads, which are behind today’s online services, serve independent requests, and have large instruction footprints and little data locality. As such, they benefit from processor designs that feature many cores and a modestly sized Last-Level Cache (LLC), a fast access path to the LLC, and high-bandwidth interfaces to memory. Existing server-class processors with large LLCs and a handful of aggressive out-of-order cores are inefficient in executing scale-out workloads. Moreover, the scaling trajectory for these processors leads to even lower efficiency in future technology nodes. This thesis presents a family of throughput-optimal processors, called Scale-Out Processors, for the efficient execution of scale-out workloads. A unique feature of Scale-Out Processors is that they consist of multiple stand-alone modules, called pods, wherein each module is a server running an operating system and a full software stack. To design a throughput-optimal processor, we developed a methodology based on performance density, defined as throughput per unit area, to quantify how effectively an architecture uses the silicon real estate. The proposed methodology derives a performance-density optimal processor building block (i.e., pod), which tightly couples a number of cores to a small LLC via a fast interconnect. Scale-Out Processors simply consist of multiple pods with no inter-pod connectivity or coherence. Moreover, they deliver the highest throughput in today’s technology and afford near-ideal scalability as process technology advances. We demonstrate that Scale-Out Processors improve datacenters’ efficiency by 4.4x-7.1x over datacenters designed using existing server-class processors
    corecore