84 research outputs found
Multianalytical provenance analysis of Eastern Ross Sea LGM till sediments (Antarctica): Petrography, geochronology, and thermochronology detrital data
In order to reveal provenance of detrital sediments supplied by West Antarctic Ice Sheet (WAIS), 19 glaciomarine cores of Last Glacial Maximum age were analyzed from Eastern Ross Sea and Sulzberger Bay. Analytical techniques included petrographic analysis of gravel-sized clasts, geochronology (zircon U-Pb: Zrn-UPb) and thermochronology (apatite fission track: AFT) of sand-sized fractions. Petrographic analysis revealed a similarity with the lithologies presently exposed in western Marie Byrd Land (MBL), with major roles played by low-grade metamorphic rocks and granitoids. Furthermore Zrn-UPb and AFT data allowed to identify the ages of formation and cooling of sedimentary source area, consisting of Cambrian-Precambrian basement (i.e., Swanson Formation in western MBL) which underwent at least two episodes of magma intrusion, migmatization and cooling during Devonian-Carboniferous and Cretaceous-Paleocene times. Scarcity of volcanic clasts in the region of Ross Sea along the front of West Antarctica Ice Streams in association with the occurrence of AFT Oligocene-Pliocene dates suggests a localized tectonic exhumation of portions of MBL, as already documented for the opposite side of West Antarctic Rift System in the Transantarctic Mountains. Furthermore, a Zrn-UPb and AFT population of Late Triassic-Jurassic age indicates the presence of unexposed rocks that formed or metamorphosed at that time in the sedimentary source area, which could be identified in McAyeal Ice Stream and Bindschadler Ice Stream catchment areas
Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication
From classical HPC to deep learning, MatMul is at the heart of today's
computing. The recent Maddness method approximates MatMul without the need for
multiplication by using a hash-based version of product quantization (PQ)
indexing into a look-up table (LUT). Stella Nera is the first Maddness
accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more
than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators
implemented in the same technology. The hash function is a decision tree, which
allows for an efficient hardware implementation as the multiply-accumulate
operations are replaced by decision tree passes and LUT lookups. The entire
Maddness MatMul can be broken down into parts that allow an effective
implementation with small computing units and memories, allowing it to reach
extreme efficiency while remaining generically applicable for MatMul tasks. In
a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency
of 161 TOp/s/[email protected] with a Top-1 accuracy on CIFAR-10 of more than 92.5% using
ResNet9.Comment: 6 pages, 7 figures, preprint under revie
Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV
Sparse matrix vector multiplication (SpMV) is central to numerous
data-intensive applications, but requires streaming indirect memory accesses
that severely degrade both processing and memory throughput in state-of-the-art
architectures. Near-memory hardware units, decoupling indirect streams from
processing elements, partially alleviate the bottleneck, but rely on low DRAM
access granularity, which is highly inefficient for modern DRAM standards like
HBM and LPDDR. To fully address the end-to-end challenge, we propose a
low-overhead data coalescer combined with a near-memory indirect streaming unit
for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow
irregular stream elements onto wide memory buses. Our combined solution
leverages the memory-level parallelism and coalescence of streaming indirect
accesses in irregular applications like SpMV to maximize the performance and
bandwidth efficiency attained on wide memory interfaces. Our solution delivers
an average speedup of 8x in effective indirect access, often reaching the full
memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV
of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency,
requiring merely 27kB of on-chip storage and a very compact implementation area
of 0.2-0.3mm^2 in a 12nm node.Comment: 6 pages, 6 figures. Submitted to DATE 202
Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV1.0 Compliant Open-Source Processor
Vector processing is highly effective in boosting processor performance and
efficiency for data-parallel workloads. In this paper, we present Ara2, the
first fully open-source vector processor to support the RISC-V V 1.0 frozen
ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels
for various problem sizes and vector-unit configurations, achieving an average
functional-unit utilization of 95% on the most computationally intensive
kernels. We pinpoint performance boosters and bottlenecks, including the scalar
core, memories, and vector architecture, providing insights into the main
vector architecture's performance drivers. Leveraging the openness of the
design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on
various configurations (2-16 lanes), and analyze its microarchitecture and
implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency
of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40
FO4 gates). Finally, we explore the performance and energy-efficiency
trade-offs of multi-core vector processors: we find that multiple vector cores
help overcome the scalar core issue-rate bound that limits short-vector
performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves
more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when
executing a 32x32x32 matrix multiplication, with 1.5x improved energy
efficiency
Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters
While parallel architectures based on clusters of Processing Elements (PEs)
sharing L1 memory are widespread, there is no consensus on how lean their PE
should be. Architecting PEs as vector processors holds the promise to greatly
reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck
(VNB). However, due to their historical association with supercomputers,
classical vector machines include micro-architectural tricks to improve the
Instruction Level Parallelism (ILP), which increases their instruction fetch
and decode energy overhead. In this paper, we explore for the first time vector
processing as an option to build small and efficient PEs for large-scale
shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector
processing unit based on the integer embedded subset of the RISC-V Vector
Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate
Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate
operation, 40% less energy than an equivalent cluster built with four Snitch
scalar cores. We analyzed Spatz' performance by integrating it within MemPool,
a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system
achieves up to 285 GOPS when running a 256x256 32-bit integer matrix
multiplication, 70% more than the equivalent Snitch-based MemPool system. In
terms of energy efficiency, the Spatz-based MemPool system achieves up to 266
GOPS/W when running the same kernel, more than twice the energy efficiency of
the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show
the viability of lean vector processors as high-performance and
energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.Comment: 9 pages. Accepted for publication in the 2022 International
Conference on Computer-Aided Design (ICCAD 2022
Convolutional neural networks for vision neuroscience: significance, developments, and outstanding issues
Darkside: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training
On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present Darkside, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. Darkside is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency – enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference
Effects of mitotane on the hypothalamic-pituitary-adrenal axis in patients with adrenocortical carcinoma
Objective: Mitotane, a drug used to treat adrenocortical cancer (ACC), inhibits multiple enzymatic steps of adrenocortical steroid biosynthesis, potentially causing adrenal insufficiency. Recent studies in vitro have also documented a direct inhibitory effect of mitotane at the pituitary level. The present study was aimed to assess the hypothalamic-pituitary-adrenal axis in patients with ACC receiving mitotane. Design and methods: We prospectively enrolled 16 patients on adjuvant treatment with mitotane after radical surgical resection of ACC, who underwent standard hormone evaluation and h-CRH stimulation. A group of 10 patients with primary adrenal insufficiency (PAI) served as controls for the CRH test. Results: We demonstrated a close correlation between cortisol-binding globulin (CBG) and plasma mitotane levels, and a non-significant trend between mitotane dose and either serum or salivary cortisol in ACC patients. We did not find any correlation between the dose of cortisone acetate and either ACTH or cortisol levels. ACTH levels were significantly higher in patients with PAI than that in patients with ACC, both in baseline conditions (88.99 (11.04-275.00) vs 24.53 (6.16-121.88) pmol/L, P = 0.031) and following CRH (158.40 (34.32-275.00) vs 67.43 (8.8-179.52) pmol/L P = 0.016). Conclusions: The observation of lower ACTH levels in patients with ACC than that in patients with PAI, both in basal conditions and after CRH stimulation, suggests that mitotane may play an inhibitory effect on ACTH secretion at the pituitary levels. In conclusion, the present study shows that mitotane affects the HPA axis at multiple levels and no single biomarker may be used for the assessment of adrenal insufficiency
- …