Search CORE

1,028 research outputs found

Genomic co-processor for long read assembly

Author: Martí Quesada Marina
Publication venue: Universitat Politècnica de Catalunya
Publication date: 05/06/2021
Field of study

Genomics data is transforming medicine and our understanding of life in fundamental ways; however, it is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. In order to enable the vast potential of exponentially growing genomics data, domain specific acceleration provides one of the few remaining approaches to continue to scale compute performance and efficiency, since general-purpose architectures are struggling to handle the huge amount of data needed for genome alignment. The aim of this project is to implement a genomic-coprocessor targeting HPC FPGAs starting from the Darwin FPGA co-processor. In this scenario, the final objective is the simulation and implementation of the algorithms described by Darwin using Alveo boards, exploiting High Bandwidth Memory (HBM) to increase its performance

UPCommons. Portal del coneixement obert de la UPC

Verification of GossipSub in ACL2s

Author: Kumar Ankit
Manolios Panagiotis
Nita-Rotaru Cristina
von Hippel Max
Publication venue
Publication date: 15/11/2023
Field of study

GossipSub is a popular new peer-to-peer network protocol designed to disseminate messages quickly and efficiently by allowing peers to forward the full content of messages only to a dynamically selected subset of their neighboring peers (mesh neighbors) while gossiping about messages they have seen with the rest. Peers decide which of their neighbors to graft or prune from their mesh locally and periodically using a score for each neighbor. Scores are calculated using a score function that depends on mesh-specific parameters, weights and counters relating to a peer's performance in the network. Since a GossipSub network's performance ultimately depends on the performance of its peers, an important question arises: Is the score calculation mechanism effective in weeding out non-performing or even intentionally misbehaving peers from meshes? We answered this question in the negative in our companion paper by reasoning about GossipSub using our formal, official and executable ACL2s model. Based on our findings, we synthesized and simulated attacks against GossipSub which were confirmed by the developers of GossipSub, FileCoin, and Eth2.0, and publicly disclosed in MITRE CVE-2022-47547. In this paper, we present a detailed description of our model. We discuss design decisions, security properties of GossipSub, reasoning about the security properties in context of our model, attack generation and lessons we learnt when writing it.Comment: In Proceedings ACL2-2023, arXiv:2311.0837

arXiv.org e-Print Archive

Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

Author: Aga Shaizeen
Alsop Johnathan
Ibrahim Mohamed
Islam Mahzabeen
Jayasena Nuwan
Mccrabb Andrew
Publication venue
Publication date: 18/09/2023
Field of study

Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly

arXiv.org e-Print Archive

Signal and Power Integrity Challenges for High Density System-on-Package

Author: Li Feng
Totorica Nathan
Publication venue: 'Bilingual Publishing Co.'
Publication date: 13/06/2022
Field of study

As the increasing desire for more compact, portable devices outpaces Moore’s law, innovation in packaging and system design has played a significant role in the continued miniaturization of electronic systems.Integrating more active and passive components into the package itself, as the case for system-on-package (SoP), has shown very promising results in overall size reduction and increased performance of electronic systems.With this ability to shrink electrical systems comes the many challenges of sustaining, let alone improving, reliability and performance. The fundamental signal, power, and thermal integrity issues are discussed in detail, along with published techniques from around the industry to mitigate these issues in SoP applications

Bilingual Publishing Co. (BPC): E-Journals

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Author: Fridman Yehonatan
Oren Gal
Tamir Guy
Publication venue
Publication date: 09/04/2023
Field of study

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the potability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that PVC is up to 37% better than the A100 on the LULESH benchmark, presenting better performance in computing and data movements.Comment: 13 page

arXiv.org e-Print Archive

Novel Rail Clamp Architectures and Their Systematic Design

Author
Publication venue
Publication date: 01/01/2016
Field of study

abstract: Rail clamp circuits are widely used for electrostatic discharge (ESD) protection in semiconductor products today. A step-by-step design procedure for the traditional RC and single-inverter-based rail clamp circuit and the design, simulation, implementation, and operation of two novel rail clamp circuits are described for use in the ESD protection of complementary metal-oxide-semiconductor (CMOS) circuits. The step-by-step design procedure for the traditional circuit is technology-node independent, can be fully automated, and aims to achieve a minimal area design that meets specified leakage and ESD specifications under all valid process, voltage, and temperature (PVT) conditions. The first novel rail clamp circuit presented employs a comparator inside the traditional circuit to reduce the value of the time constant needed. The second circuit uses a dynamic time constant approach in which the value of the time constant is dynamically adjusted after the clamp is triggered. Important metrics for the two new circuits such as ESD performance, latch-on immunity, clamp recovery time, supply noise immunity, fastest power-on time supported, and area are evaluated over an industry-standard PVT space using SPICE simulations and measurements on a fabricated 40 nm test chip.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

High-Density Solid-State Memory Devices and Technologies

Author
Publication venue: 'MDPI AG'
Publication date: 06/05/2022
Field of study

This Special Issue aims to examine high-density solid-state memory devices and technologies from various standpoints in an attempt to foster their continuous success in the future. Considering that broadening of the range of applications will likely offer different types of solid-state memories their chance in the spotlight, the Special Issue is not focused on a specific storage solution but rather embraces all the most relevant solid-state memory devices and technologies currently on stage. Even the subjects dealt with in this Special Issue are widespread, ranging from process and design issues/innovations to the experimental and theoretical analysis of the operation and from the performance and reliability of memory devices and arrays to the exploitation of solid-state memories to pursue new computing paradigms

Directory of Open Access Books (DOAB)

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments

arXiv.org e-Print Archive

PartIR: Composing SPMD Partitioning Strategies for Machine Learning

Author: Alabed Sami
Belov Daniel
Chrzaszcz Bart
Franco Juliana
Grewe Dominik
Maclaurin Dougal
Molloy James
Natan Tom
Norman Tamara
Pan Xiaoyue
Paszke Adam
Rink Norman A.
Schaarschmidt Michael
Sitdikov Timur
Swietlik Agnieszka
Vytiniotis Dimitrios
Wee Joel
Publication venue
Publication date: 03/03/2024
Field of study

Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance.

arXiv.org e-Print Archive