1,028 research outputs found
Genomic co-processor for long read assembly
Genomics data is transforming medicine and our understanding of life in fundamental ways; however, it is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. In order to enable the vast potential of exponentially growing genomics data, domain specific acceleration provides one of the few remaining approaches to continue to scale compute performance and efficiency, since general-purpose architectures are struggling to handle the huge amount of data needed for genome alignment. The aim of this project is to implement a genomic-coprocessor targeting HPC FPGAs starting from the Darwin FPGA co-processor. In this scenario, the final objective is the simulation and implementation of the algorithms described by Darwin using Alveo boards, exploiting High Bandwidth Memory (HBM) to increase its performance
Verification of GossipSub in ACL2s
GossipSub is a popular new peer-to-peer network protocol designed to
disseminate messages quickly and efficiently by allowing peers to forward the
full content of messages only to a dynamically selected subset of their
neighboring peers (mesh neighbors) while gossiping about messages they have
seen with the rest. Peers decide which of their neighbors to graft or prune
from their mesh locally and periodically using a score for each neighbor.
Scores are calculated using a score function that depends on mesh-specific
parameters, weights and counters relating to a peer's performance in the
network. Since a GossipSub network's performance ultimately depends on the
performance of its peers, an important question arises: Is the score
calculation mechanism effective in weeding out non-performing or even
intentionally misbehaving peers from meshes? We answered this question in the
negative in our companion paper by reasoning about GossipSub using our formal,
official and executable ACL2s model. Based on our findings, we synthesized and
simulated attacks against GossipSub which were confirmed by the developers of
GossipSub, FileCoin, and Eth2.0, and publicly disclosed in MITRE
CVE-2022-47547. In this paper, we present a detailed description of our model.
We discuss design decisions, security properties of GossipSub, reasoning about
the security properties in context of our model, attack generation and lessons
we learnt when writing it.Comment: In Proceedings ACL2-2023, arXiv:2311.0837
Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures
Continual demand for memory bandwidth has made it worthwhile for memory
vendors to reassess processing in memory (PIM), which enables higher bandwidth
by placing compute units in/near-memory. As such, memory vendors have recently
proposed commercially viable PIM designs. However, these proposals are largely
driven by the needs of (a narrow set of) machine learning (ML) primitives.
While such proposals are reasonable given the the growing importance of ML, as
memory is a pervasive component, %in this work, we make there is a case for a
more inclusive PIM design that can accelerate primitives across domains.
In this work, we ascertain the capabilities of commercial PIM proposals to
accelerate various primitives across domains. We first begin with outlining a
set of characteristics, termed PIM-amenability-test, which aid in assessing if
a given primitive is likely to be accelerated by PIM. Next, we apply this test
to primitives under study to ascertain efficient data-placement and
orchestration to map the primitives to underlying PIM architecture. We observe
here that, even though primitives under study are largely PIM-amenable,
existing commercial PIM proposals do not realize their performance potential
for these primitives. To address this, we identify bottlenecks that arise in
PIM execution and propose hardware and software optimizations which stand to
broaden the acceleration reach of commercial PIM designs (improving average PIM
speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we
believe emerging commercial PIM proposals add a necessary and complementary
design point in the application acceleration space, hardware-software co-design
is necessary to deliver their benefits broadly
Signal and Power Integrity Challenges for High Density System-on-Package
As the increasing desire for more compact, portable devices outpaces Moore’s law, innovation in packaging and system design has played a significant role in the continued miniaturization of electronic systems.Integrating more active and passive components into the package itself, as the case for system-on-package (SoP), has shown very promising results in overall size reduction and increased performance of electronic systems.With this ability to shrink electrical systems comes the many challenges of sustaining, let alone improving, reliability and performance. The fundamental signal, power, and thermal integrity issues are discussed in detail, along with published techniques from around the industry to mitigate these issues in SoP applications
Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators
Over the last decade, most of the increase in computing power has been gained
by advances in accelerated many-core architectures, mainly in the form of
GPGPUs. While accelerators achieve phenomenal performances in various computing
tasks, their utilization requires code adaptations and transformations. Thus,
OpenMP, the most common standard for multi-threading in scientific computing
applications, introduced offloading capabilities between host (CPUs) and
accelerators since v4.0, with increasing support in the successive v4.5, v5.0,
v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the
Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the
market, with the oneAPI and GNU LLVM-backed compilation for offloading,
correspondingly. In this work, we present early performance results of OpenMP
offloading capabilities to these devices while specifically analyzing the
potability of advanced directives (using SOLLVE's OMPVV test suite) and the
scalability of the hardware in representative scientific mini-app (the LULESH
benchmark). Our results show that the vast majority of the offloading
directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU
compilers; however, the support in v5.1 and v5.2 is still lacking. From the
performance perspective, we found that PVC is up to 37% better than the A100 on
the LULESH benchmark, presenting better performance in computing and data
movements.Comment: 13 page
Novel Rail Clamp Architectures and Their Systematic Design
abstract: Rail clamp circuits are widely used for electrostatic discharge (ESD) protection in semiconductor products today. A step-by-step design procedure for the traditional RC and single-inverter-based rail clamp circuit and the design, simulation, implementation, and operation of two novel rail clamp circuits are described for use in the ESD protection of complementary metal-oxide-semiconductor (CMOS) circuits. The step-by-step design procedure for the traditional circuit is technology-node independent, can be fully automated, and aims to achieve a minimal area design that meets specified leakage and ESD specifications under all valid process, voltage, and temperature (PVT) conditions. The first novel rail clamp circuit presented employs a comparator inside the traditional circuit to reduce the value of the time constant needed. The second circuit uses a dynamic time constant approach in which the value of the time constant is dynamically adjusted after the clamp is triggered. Important metrics for the two new circuits such as ESD performance, latch-on immunity, clamp recovery time, supply noise immunity, fastest power-on time supported, and area are evaluated over an industry-standard PVT space using SPICE simulations and measurements on a fabricated 40 nm test chip.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
High-Density Solid-State Memory Devices and Technologies
This Special Issue aims to examine high-density solid-state memory devices and technologies from various standpoints in an attempt to foster their continuous success in the future. Considering that broadening of the range of applications will likely offer different types of solid-state memories their chance in the spotlight, the Special Issue is not focused on a specific storage solution but rather embraces all the most relevant solid-state memory devices and technologies currently on stage. Even the subjects dealt with in this Special Issue are widespread, ranging from process and design issues/innovations to the experimental and theoretical analysis of the operation and from the performance and reliability of memory devices and arrays to the exploitation of solid-state memories to pursue new computing paradigms
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Deep learning recommendation models (DLRMs) are used across many
business-critical services at Facebook and are the single largest AI
application in terms of infrastructure demand in its data-centers. In this
paper we discuss the SW/HW co-designed solution for high-performance
distributed training of large-scale DLRMs. We introduce a high-performance
scalable software stack based on PyTorch and pair it with the new evolution of
Zion platform, namely ZionEX. We demonstrate the capability to train very large
DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup
in terms of time to solution over previous systems. We achieve this by (i)
designing the ZionEX platform with dedicated scale-out network, provisioned
with high bandwidth, optimal topology and efficient transport (ii) implementing
an optimized PyTorch-based training stack supporting both model and data
parallelism (iii) developing sharding algorithms capable of hierarchical
partitioning of the embedding tables along row, column dimensions and load
balancing them across multiple workers; (iv) adding high-performance core
operators while retaining flexibility to support optimizers with fully
deterministic updates (v) leveraging reduced precision communications,
multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we
develop and briefly comment on distributed data ingestion and other supporting
services that are required for the robust and efficient end-to-end training in
production environments
PartIR: Composing SPMD Partitioning Strategies for Machine Learning
Training of modern large neural networks (NN) requires a combination of
parallelization strategies encompassing data, model, or optimizer sharding.
When strategies increase in complexity, it becomes necessary for partitioning
tools to be 1) expressive, allowing the composition of simpler strategies, and
2) predictable to estimate performance analytically. We present PartIR, our
design for a NN partitioning system. PartIR is focused on an incremental
approach to rewriting and is hardware-and-runtime agnostic. We present a simple
but powerful API for composing sharding strategies and a simulator to validate
them. The process is driven by high-level programmer-issued partitioning
tactics, which can be both manual and automatic. Importantly, the tactics are
specified separately from the model code, making them easy to change. We
evaluate PartIR on several different models to demonstrate its predictability,
expressibility, and ability to reach peak performance.
- …