279 research outputs found
TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays.
SRAM arrays are widely used in a majority of specialized or general-purpose
chips that perform the computation to store data inside the chip. Thus,
SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs.
However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high
energy consumption, 3) high TRNG latency, and 4) the inability to generate true
random numbers continuously, which limits the application space of SRAM-based
TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes
these four key limitations and thus, extends the application space of
SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput,
energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous
operation. TuRaN leverages the key observation that accessing SRAM cells
results in random access failures when the supply voltage is reduced below the
manufacturer-recommended supply voltage. TuRaN generates random numbers at high
throughput by repeatedly accessing SRAM cells with reduced supply voltage and
post-processing the resulting random faults using the SHA-256 hash function. To
demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different
process nodes and analyze the potential of access failure for use as an entropy
source. We verify and support our simulation results by conducting real-world
experiments on two commercial off-the-shelf FPGA boards. We evaluate the
quality of the random numbers generated by TuRaN using the widely-adopted NIST
standard randomness tests and observe that TuRaN passes all tests. TuRaN
generates true random numbers with (i) an average (maximum) throughput of
1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us
latency
Massive Data-Centric Parallelism in the Chiplet Era
Traditionally, massively parallel applications are executed on distributed
systems, where computing nodes are distant enough that the parallelization
schemes must minimize communication and synchronization to achieve scalability.
Mapping communication-intensive workloads to distributed systems requires
complicated problem partitioning and dataset pre-processing. With the current
AI-driven trend of having thousands of interconnected processors per chip,
there is an opportunity to re-think these communication-bottlenecked workloads.
This bottleneck often arises from data structure traversals, which cause
irregular memory accesses and poor cache locality.
Recent works have introduced task-based parallelization schemes to accelerate
graph traversal and other sparse workloads. Data structure traversals are split
into tasks and pipelined across processing units (PUs). Dalorex demonstrated
the highest scalability (up to thousands of PUs on a single chip) by having the
entire dataset on-chip, scattered across PUs, and executing the tasks at the PU
where the data is local. However, it also raised questions on how to scale to
larger datasets when all the memory is on chip, and at what cost.
To address these challenges, we propose a scalable architecture composed of a
grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time
reconfiguration enables creating chip products that optimize for different
target metrics, such as time-to-solution, energy, or cost, while software
reconfigurations avoid network saturation when scaling to millions of PUs
across many chip packages. We evaluate six applications and four datasets, with
several configurations and memory technologies, to provide a detailed analysis
of the performance, power, and cost of data-local execution at scale. Our
parallelization of Breadth-First-Search with RMAT-26 across a million PUs
reaches 3323 GTEPS
WiSync: an architecture for fast synchronization through on-chip wireless communication
In shared-memory multiprocessing, fine-grain synchronization is challenging because it requires frequent communication. As technology scaling delivers larger manycore chips, such pattern is expected to remain costly to support.; In this paper, we propose to address this challenge by using on-chip wireless communication. Each core has a transceiver and an antenna to communicate with all the other cores. This environment supports very low latency global communication. Our architecture, called WiSync, uses a per-core Broadcast Memory (BM). When a core writes to its BM, all the other 100+ BMs get updated in less than 10 processor cycles. We also use a second wireless channel with cheaper transfers to execute barriers efficiently. WiSync supports multiprogramming, virtual memory, and context switching. Our evaluation with simulations of 128-threaded kernels and 64-threaded applications shows that WiSync speeds-up synchronization substantially. Compared to using advanced conventional synchronization, WiSync attains an average speedup of nearly one order of magnitude for the kernels, and 1.12 for PARSEC and SPLASH-2.Peer ReviewedPostprint (author's final draft
- âŠ