84 research outputs found

    Multianalytical provenance analysis of Eastern Ross Sea LGM till sediments (Antarctica): Petrography, geochronology, and thermochronology detrital data

    Get PDF
    In order to reveal provenance of detrital sediments supplied by West Antarctic Ice Sheet (WAIS), 19 glaciomarine cores of Last Glacial Maximum age were analyzed from Eastern Ross Sea and Sulzberger Bay. Analytical techniques included petrographic analysis of gravel-sized clasts, geochronology (zircon U-Pb: Zrn-UPb) and thermochronology (apatite fission track: AFT) of sand-sized fractions. Petrographic analysis revealed a similarity with the lithologies presently exposed in western Marie Byrd Land (MBL), with major roles played by low-grade metamorphic rocks and granitoids. Furthermore Zrn-UPb and AFT data allowed to identify the ages of formation and cooling of sedimentary source area, consisting of Cambrian-Precambrian basement (i.e., Swanson Formation in western MBL) which underwent at least two episodes of magma intrusion, migmatization and cooling during Devonian-Carboniferous and Cretaceous-Paleocene times. Scarcity of volcanic clasts in the region of Ross Sea along the front of West Antarctica Ice Streams in association with the occurrence of AFT Oligocene-Pliocene dates suggests a localized tectonic exhumation of portions of MBL, as already documented for the opposite side of West Antarctic Rift System in the Transantarctic Mountains. Furthermore, a Zrn-UPb and AFT population of Late Triassic-Jurassic age indicates the presence of unexposed rocks that formed or metamorphosed at that time in the sedimentary source area, which could be identified in McAyeal Ice Stream and Bindschadler Ice Stream catchment areas

    Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

    Full text link
    From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/[email protected] with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.Comment: 6 pages, 7 figures, preprint under revie

    Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

    Full text link
    Sparse matrix vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coalescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation area of 0.2-0.3mm^2 in a 12nm node.Comment: 6 pages, 6 figures. Submitted to DATE 202

    Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV1.0 Compliant Open-Source Processor

    Full text link
    Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency

    Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

    Full text link
    While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.Comment: 9 pages. Accepted for publication in the 2022 International Conference on Computer-Aided Design (ICCAD 2022

    Darkside: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

    Get PDF
    On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present Darkside, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. Darkside is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency – enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference

    Effects of mitotane on the hypothalamic-pituitary-adrenal axis in patients with adrenocortical carcinoma

    Get PDF
    Objective: Mitotane, a drug used to treat adrenocortical cancer (ACC), inhibits multiple enzymatic steps of adrenocortical steroid biosynthesis, potentially causing adrenal insufficiency. Recent studies in vitro have also documented a direct inhibitory effect of mitotane at the pituitary level. The present study was aimed to assess the hypothalamic-pituitary-adrenal axis in patients with ACC receiving mitotane. Design and methods: We prospectively enrolled 16 patients on adjuvant treatment with mitotane after radical surgical resection of ACC, who underwent standard hormone evaluation and h-CRH stimulation. A group of 10 patients with primary adrenal insufficiency (PAI) served as controls for the CRH test. Results: We demonstrated a close correlation between cortisol-binding globulin (CBG) and plasma mitotane levels, and a non-significant trend between mitotane dose and either serum or salivary cortisol in ACC patients. We did not find any correlation between the dose of cortisone acetate and either ACTH or cortisol levels. ACTH levels were significantly higher in patients with PAI than that in patients with ACC, both in baseline conditions (88.99 (11.04-275.00) vs 24.53 (6.16-121.88) pmol/L, P = 0.031) and following CRH (158.40 (34.32-275.00) vs 67.43 (8.8-179.52) pmol/L P = 0.016). Conclusions: The observation of lower ACTH levels in patients with ACC than that in patients with PAI, both in basal conditions and after CRH stimulation, suggests that mitotane may play an inhibitory effect on ACTH secretion at the pituitary levels. In conclusion, the present study shows that mitotane affects the HPA axis at multiple levels and no single biomarker may be used for the assessment of adrenal insufficiency
    corecore