5 research outputs found

    Efficient Design and Communication for 3D Stacked Dynamic Memory

    Get PDF
    As computer memory increases in size and processors continue to get faster, memory becomes an increasing bottleneck to system performance. To mitigate the slow DRAM memory chip speeds, a new generation of 3D stacked DRAM will allow lower power consumption and higher bandwidth. To communicate between these chips, this paper proposes the use of ring based standing wave oscillators for fast data transfer. With a fast clocking scheme, multiple channels can share the same bus to reduce TSVs and maintain similar memory latencies. Simulations with the new clocking scheme and data transfers are performed to show the improvements that can be made in memory communication. Experimental results show that a ring based clocking scheme can obtain two times the speed up of current stacked memory chips. Variations of this clocking scheme can also provide half the power consumption with comparable speeds. These ring-based architectures allow higher memory speeds without compromising the complexity of the hardware. This allows the ring-based memory architecture to trade off power, throughput, and latency to improve system performance for different applications

    A PLL Design Based on a Standing Wave Resonant Oscillator

    Get PDF
    In this thesis, we present a new continuously variable high frequency standing wave oscillator and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by coarse and fine tuning. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator ring were extracted with skin effect accounted for. Our PLL provides a frequency locking range from 6 GHz to 9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56 percent. These numbers are significant improvements over the prior art in standing wave based PLLs

    Design of Special Function Units in Modern Microprocessors

    Get PDF
    Today’s computing systems demand high performance for applications such as cloud computing, web-based search engines, network applications, and social media tasks. Such software applications involve an extensive use of hashing and arithmetic operations in their computation. In this thesis, we explore the use of new special function units (SFUs) for modern microprocessors, to accelerate such workloads. First, we design an SFU for hashing. Hashing can reduce the complexity of search and lookup from O(p) to O(p/n), where n bins are used and p items are being processed. In modern microprocessors, hashing is done in software. In our work, we propose a novel hardware hash unit design for use in modern microprocessors. Since the hash unit is designed at the hardware level, several advantages are obtained by our approach. First, a hardware-based hash unit executes a single hash instruction to perform a hash operation. In a software-based hashing in modern microprocessors, a hash operation is compiled into multiple instructions, thereby degrading performance. Second, software-based hashing stores hash data in a DRAM (also, hash operation entries can be stored in one of the cache levels). In a hardware-based hash unit, hash data is stored in a dedicated memory module (a hardware hash table), which improves performance. Third, today’s operating systems execute multiple applications (processes) in parallel, which entail high memory utilization. Hence the operating systems require many context switching between different processes, which results in many cache misses. In a hardware-based hash unit, the cache misses is reduced significantly using the dedicated memory module (hash table). These advantages all reduce the power consumption and increase the overall system performance significantly with a minimal increase in the microprocessor’s die area. We evaluate our hardware-based hash unit and compare its performance with software-based hashing. We start by evaluating our design approach at the micro-architecture level in terms of system performance. After that, we design our approach at the circuit level design to obtain the area overhead. Also, we analyze our design’s power and delay for each hash operation. These results are compared with a traditional hashing implementation. Then, we present an FPGA-based coprocessor for hash unit acceleration, applied to a virus checking application. Second, we present an SFU to speed up arithmetic operations. We call this arithmetic SFU a programmable arithmetic unit (PAU). In modern microprocessors, applications that require heavy arithmetic computations are done in software. To improve the performance for such computations, we present a programmable arithmetic unit (PAU), a partially reconfigurable methodology for arithmetic applications. The PAU consists of a set of IP blocks connected to a reconfigurable FPGA controller via a fast mesh-based interconnect. The IP blocks in the PAU can be any IP block such as adders, subtractors, multipliers, comparators and sign extension units. The PAU can have one or more copies of the same IP block (for example, 5 adders and 7 multipliers). The FPGA controller is an on-chip FPGA-based reconfigurable control fabric. The FPGA controller enables different arithmetic applications to be embedded on the PAU. The FPGA controller is programmed for different applications. The reconfigurable logic is based on a LUT-based design like a traditional FPGA. The FPGA controller and the IP blocks in the PAU communicate via a high speed ring data fabric. In our work, we use the PAU as an SFU in modern microprocessors. We compare the performance of different hardware-based arithmetic applications in the PAU with software-based implementations in modern microprocessors

    Ring-Based Resonant Standing Wave Oscillators for 3D Clocking Applications

    Get PDF
    Ring-based resonant standing wave oscillators have been shown to be a useful clocking tech-nique that can distribute and generate a high frequency, low skew, low power, and stable clock signal. By using through-silicon-vias, this type of standing wave oscillator can be used to gener-ate the clocking scheme for 3D integrated circuits. In this thesis, we propose the use of such 3D standing wave oscillators and show how independent 3D oscillators in different stacks can syn-chronize through the use of a redistribution layer stub. Inter-chip clock synchronization is then accomplished without the need for a PLL. In addition, we propose the first 3D ring-based resonant standing wave oscillator bootstrap and reset circuit to initialize and stop oscillation. Using a 3D ring-based resonant standing wave oscillator, we propose a ring-based data fabric for 3D stacked DRAM and compare the results with existing approaches such as High Bandwidth Memory (HBM) or Wide I/O memory. We show that our Memory Architecture using a Ring-based Scheme (MARS) can provide the increases in speed necessary to overcome current memory bottlenecks, and can scale effectively as future 3D stacks become larger. Our MARS can trade off power, throughput, and latency to match different application requirements. By using a narrow bus, and connecting it to all channels, the MARS8 can provide an alternative memory configuration with ∼ 6.9× lower power consumption than HBM, and ∼ 2.7× faster speeds than Wide I/O. Using multiple ring topologies in the same stack, the channel count can double from 8 to 16, and then to 32. This is possible since MARS uses about 4× fewer TSVs per channel than HBM or Wide I/O. This provides speeds up to ∼ 4.2× faster than traditional HBM. This scalable architecture allows higher throughput and faster system performance for next-generation DRAM. The MARS topology proposed in this thesis can be used in a variety of computing systems, from lightweight IoT to large-scale data centers
    corecore