279 research outputs found

    TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs

    Full text link
    Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays. SRAM arrays are widely used in a majority of specialized or general-purpose chips that perform the computation to store data inside the chip. Thus, SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs. However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high energy consumption, 3) high TRNG latency, and 4) the inability to generate true random numbers continuously, which limits the application space of SRAM-based TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes these four key limitations and thus, extends the application space of SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput, energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous operation. TuRaN leverages the key observation that accessing SRAM cells results in random access failures when the supply voltage is reduced below the manufacturer-recommended supply voltage. TuRaN generates random numbers at high throughput by repeatedly accessing SRAM cells with reduced supply voltage and post-processing the resulting random faults using the SHA-256 hash function. To demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different process nodes and analyze the potential of access failure for use as an entropy source. We verify and support our simulation results by conducting real-world experiments on two commercial off-the-shelf FPGA boards. We evaluate the quality of the random numbers generated by TuRaN using the widely-adopted NIST standard randomness tests and observe that TuRaN passes all tests. TuRaN generates true random numbers with (i) an average (maximum) throughput of 1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us latency

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

    WiSync: an architecture for fast synchronization through on-chip wireless communication

    Get PDF
    In shared-memory multiprocessing, fine-grain synchronization is challenging because it requires frequent communication. As technology scaling delivers larger manycore chips, such pattern is expected to remain costly to support.; In this paper, we propose to address this challenge by using on-chip wireless communication. Each core has a transceiver and an antenna to communicate with all the other cores. This environment supports very low latency global communication. Our architecture, called WiSync, uses a per-core Broadcast Memory (BM). When a core writes to its BM, all the other 100+ BMs get updated in less than 10 processor cycles. We also use a second wireless channel with cheaper transfers to execute barriers efficiently. WiSync supports multiprogramming, virtual memory, and context switching. Our evaluation with simulations of 128-threaded kernels and 64-threaded applications shows that WiSync speeds-up synchronization substantially. Compared to using advanced conventional synchronization, WiSync attains an average speedup of nearly one order of magnitude for the kernels, and 1.12 for PARSEC and SPLASH-2.Peer ReviewedPostprint (author's final draft
    • 

    corecore