563 research outputs found
Memory-Centric Computing
Memory-centric computing aims to enable computation capability in and near
all places where data is generated and stored. As such, it can greatly reduce
the large negative performance and energy impact of data access and data
movement, by fundamentally avoiding data movement and reducing data access
latency & energy. Many recent studies show that memory-centric computing can
greatly improve system performance and energy efficiency. Major industrial
vendors and startup companies have also recently introduced memory chips that
have sophisticated computation capabilities.
This talk describes promising ongoing research and development efforts in
memory-centric computing. We classify such efforts into two major fundamental
categories: 1) processing using memory, which exploits analog operational
properties of memory structures to perform massively-parallel operations in
memory, and 2) processing near memory, which integrates processing capability
in memory controllers, the logic layer of 3D-stacked memory technologies, or
memory chips to enable high-bandwidth and low-latency memory access to
near-memory logic. We show both types of architectures (and their combination)
can enable orders of magnitude improvements in performance and energy
consumption of many important workloads, such as graph analytics, databases,
machine learning, video processing, climate modeling, genome analysis. We
discuss adoption challenges for the memory-centric computing paradigm and
conclude with some research & development opportunities.Comment: To appear as an invited special session paper at DAC 202
A sub-mW IoT-endnode for always-on visual monitoring and smart triggering
This work presents a fully-programmable Internet of Things (IoT) visual
sensing node that targets sub-mW power consumption in always-on monitoring
scenarios. The system features a spatial-contrast binary
pixel imager with focal-plane processing. The sensor, when working at its
lowest power mode ( at 10 fps), provides as output the number of
changed pixels. Based on this information, a dedicated camera interface,
implemented on a low-power FPGA, wakes up an ultra-low-power parallel
processing unit to extract context-aware visual information. We evaluate the
smart sensor on three always-on visual triggering application scenarios.
Triggering accuracy comparable to RGB image sensors is achieved at nominal
lighting conditions, while consuming an average power between and
, depending on context activity. The digital sub-system is extremely
flexible, thanks to a fully-programmable digital signal processing engine, but
still achieves 19x lower power consumption compared to MCU-based cameras with
significantly lower on-board computing capabilities.Comment: 11 pages, 9 figures, submitteted to IEEE IoT Journa
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
Neural personalized recommendation is the corner-stone of a wide collection
of cloud services and products, constituting significant compute demand of the
cloud infrastructure. Thus, improving the execution efficiency of neural
recommendation directly translates into infrastructure capacity saving. In this
paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that
adopts an algorithm and system co-design methodology to custom-design systems
for recommendation use cases. Leveraging the insights from the recommendation
characterization, a new dynamic scheduler, DeepRecSched, is proposed to
maximize latency-bounded throughput by taking into account characteristics of
inference query size and arrival patterns, recommendation model architectures,
and underlying hardware systems. By doing so, system throughput is doubled
across the eight industry-representative recommendation models. Finally,
design, deployment, and evaluation in at-scale production datacenter shows over
30% latency reduction across a wide variety of recommendation models running on
hundreds of machines
- …