334 research outputs found
D-STEM: a Design led approach to STEM innovation
Advances in the Science, Technology, Engineering and Maths (STEM) disciplines offer opportunities for designers to propose and make products with advanced, enhanced and engineered properties and functionalities. In turn, these advanced characteristics are becoming increasingly necessary as resources become ever more strained through 21st century demands, such as ageing populations, connected communities, depleting raw materials, waste management and energy supply. We need to make things that are smarter, make our lives easier, better and simpler. The products of tomorrow need to do more with less. The issue is how to maximize the potential for exploiting opportunities offered by STEM developments and how best to enable designers to strengthen their position within the innovation ecosystem. As a society, we need designers able to navigate emerging developments from the STEM community to a level that enables understanding and knowledge of the new material properties, the skill set to facilitate absorption into the design âtoolboxâ and the agility to identify, manage and contextualise innovation opportunities emerging from STEM developments. This paper proposes the blueprint for a new design led approach to STEM innovation that begins to redefine studio culture for the 21st Century
Allo: A Programming Model for Composable Accelerator Design
Special-purpose hardware accelerators are increasingly pivotal for sustaining
performance improvements in emerging applications, especially as the benefits
of technology scaling continue to diminish. However, designers currently lack
effective tools and methodologies to construct complex, high-performance
accelerator architectures in a productive manner. Existing high-level synthesis
(HLS) tools often require intrusive source-level changes to attain satisfactory
quality of results. Despite the introduction of several new accelerator design
languages (ADLs) aiming to enhance or replace HLS, their advantages are more
evident in relatively simple applications with a single kernel. Existing ADLs
prove less effective for realistic hierarchical designs with multiple kernels,
even if the design hierarchy is flattened.
In this paper, we introduce Allo, a composable programming model for
efficient spatial accelerator design. Allo decouples hardware customizations,
including compute, memory, communication, and data type from algorithm
specification, and encapsulates them as a set of customization primitives. Allo
preserves the hierarchical structure of an input program by combining
customizations from different functions in a bottom-up, type-safe manner. This
approach facilitates holistic optimizations that span across function
boundaries. We conduct comprehensive experiments on commonly-used HLS
benchmarks and several realistic deep learning models. Our evaluation shows
that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases
in the PolyBench. For the GPT2 model, the inference latency of the Allo
generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher
energy efficiency, demonstrating the capability of Allo to handle large-scale
designs.Comment: Accepted to PLDI'2
Decoupled Model Schedule for Deep Learning Training
Recent years have seen an increase in the development of large deep learning
(DL) models, which makes training efficiency crucial. Common practice is
struggling with the trade-off between usability and performance. On one hand,
DL frameworks such as PyTorch use dynamic graphs to facilitate model developers
at a price of sub-optimal model training performance. On the other hand,
practitioners propose various approaches to improving the training efficiency
by sacrificing some of the flexibility, ranging from making the graph static
for more thorough optimization (e.g., XLA) to customizing optimization towards
large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training
efficiency through separation of concerns. Inspired by DL compilers that
decouple the platform-specific optimizations of a tensor-level operator from
its arithmetic definition, this paper proposes a schedule language to decouple
model execution from definition. Specifically, the schedule works on a PyTorch
model and uses a set of schedule primitives to convert the model for common
model training optimizations such as high-performance kernels, effective 3D
parallelism, and efficient activation checkpointing. Compared to existing
optimization solutions, we optimize the model as-needed through high-level
primitives, and thus preserving programmability and debuggability for users to
a large extent. Our evaluation results show that by scheduling the existing
hand-crafted optimizations in a systematic way, we are able to improve training
throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by
up to 1.32x on multiple machines with up to 64 GPUs, when compared to the
out-of-the-box performance of DeepSpeed and Megatron-LM
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Recent advancements in large language models (LLMs) boasting billions of
parameters have generated a significant demand for efficient deployment in
inference workloads. The majority of existing approaches rely on temporal
architectures that reuse hardware units for different network layers and
operators. However, these methods often encounter challenges in achieving low
latency due to considerable memory access overhead. This paper investigates the
feasibility and potential of model-specific spatial acceleration for LLM
inference on FPGAs. Our approach involves the specialization of distinct
hardware units for specific operators or layers, facilitating direct
communication between them through a dataflow architecture while minimizing
off-chip memory accesses. We introduce a comprehensive analytical model for
estimating the performance of a spatial LLM accelerator, taking into account
the on-chip compute and memory resources available on an FPGA. Through our
analysis, we can determine the scenarios in which FPGA-based spatial
acceleration can outperform its GPU-based counterpart. To enable more
productive implementations of an LLM model on FPGAs, we further provide a
library of high-level synthesis (HLS) kernels that are composable and reusable.
This library will be made available as open-source. To validate the
effectiveness of both our analytical model and HLS library, we have implemented
BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results
demonstrate our approach can achieve up to 13.4x speedup when compared to
previous FPGA-based accelerators for the BERT model. For GPT generative
inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the
prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy
efficiency compared to the NVIDIA A100 GPU in the decode stage.Comment: Accepted for publication in the FCCM'24 Journal Track and will appear
in ACM Transactions on Reconfigurable Technology and Systems (TRETS
Formal Verification of Source-to-Source Transformations for HLS
Presented at: 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '24)[Abstract]: High-level synthesis (HLS) can greatly facilitate the description of complex hardware implementations, by raising the level of abstraction up to a classical imperative language such as C/C++, usually augmented with vendor-specific pragmas and APIs. Despite productivity improvements, attaining high performance for the final design remains a challenge, and higher-level tools like source-to-source compilers have been developed to generate programs targeting HLS toolchains. These tools may generate highly complex HLS-ready C/C++ code, reducing the programming effort and enabling critical optimizations. However, whether these HLS-friendly programs are produced by a human or a tool, validating their correctness or exposing bugs otherwise remains a fundamental challenge. In this work we target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to ensuring the correctness of the description provided to the HLS toolchain, by proving an optimized code version fully preserves the semantics of the unoptimized one. We introduce a novel formal verification approach that combines concrete and abstract interpretation with a hybrid symbolic analysis. Notably, our approach is mostly agnostic to how control-flow, data storage, and dataflow are implemented in the two programs. It can prove equivalence under complex bufferization and loop/syntax transformations, for a rich class of programs with statically interpretable control-flow. We present our techniques and their complete end-to-end implementation, demonstrating how our system can verify the correctness of highly complex programs generated by source-to-source compilers for HLS, and detect bugs that may elude co-simulation.This work was supported in part by an Intel ISRA award; U.S. NSF awards #1750399 and #2019306; ACE, one of seven centers in JUMP 2.0, an SRC program sponsored by DARPA; and Grant PID2022-136435NB-I00, funded by MCIN/AEI/10.13039/501100011033 and
by "ERDF A way of making Europe", EU. We are particularly thankful to Jin Yang, Jeremy Casas, and Zhenkun Yang from Intel for their support and guidance on the ISRA project. We also thank Lana Josipovi and the anonymous reviewers for their feedback on earlier versions of this manuscript.United States. National Science Foundation; 1750399United States. National Science Foundation; 201930
SpanGNN: Towards Memory-Efficient Graph Neural Networks via Spanning Subgraph Training
Graph Neural Networks (GNNs) have superior capability in learning graph data.
Full-graph GNN training generally has high accuracy, however, it suffers from
large peak memory usage and encounters the Out-of-Memory problem when handling
large graphs. To address this memory problem, a popular solution is mini-batch
GNN training. However, mini-batch GNN training increases the training variance
and sacrifices the model accuracy. In this paper, we propose a new
memory-efficient GNN training method using spanning subgraph, called SpanGNN.
SpanGNN trains GNN models over a sequence of spanning subgraphs, which are
constructed from empty structure. To overcome the excessive peak memory
consumption problem, SpanGNN selects a set of edges from the original graph to
incrementally update the spanning subgraph between every epoch. To ensure the
model accuracy, we introduce two types of edge sampling strategies (i.e.,
variance-reduced and noise-reduced), and help SpanGNN select high-quality edges
for the GNN learning. We conduct experiments with SpanGNN on widely used
datasets, demonstrating SpanGNN's advantages in the model performance and low
peak memory usage
Downregulation of E-Cadherin enhances proliferation of head and neck cancer through transcriptional regulation of EGFR
<p>Abstract</p> <p>Background</p> <p>Epidermal growth factor receptor (EGFR) has been reported to downregulate E-cadherin (E-cad); however, whether the downregulation of E-cad has any effect on EGFR expression has not been elucidated. Our previous studies have found an inverse correlation between EGFR and E-cad expression in tissue specimens of squamous cell carcinoma of the head and neck (SCCHN). To understand the biological mechanisms underlying this clinical observation, we knocked down E-cad expression utilizing E-cad siRNA in four SCCHN cell lines.</p> <p>Results</p> <p>It was observed that downregulation of E-cad upregulated EGFR expression compared with control siRNA-transfected cells after 72 hours. Cellular membrane localization of EGFR was also increased. Consequently, downstream signaling molecules of the EGFR signaling pathway, p-AKT, and p-ERK, were increased at 72 hours after the transfection with E-cad siRNA. Reverse transcriptase-polymerase chain reaction (RT-PCR) showed EGFR mRNA was upregulated by E-cad siRNA as early as 24 hours. In addition, RT-PCR revealed this upregulation was due to the increase of EGFR mRNA stability, but not protein stability. Sulforhodamine B (SRB) assay indicated growth of E-cad knocked down cells was enhanced up to 2-fold more than that of control siRNA-transfected cells at 72-hours post-transfection. The effect of E-cad reduction on cell proliferation was blocked by treating the E-cad siRNA-transfected cells with 1 ÎŒM of the EGFR-specific tyrosine kinase inhibitor erlotinib.</p> <p>Conclusion</p> <p>Our results suggest for the first time that reduction of E-cad results in upregulation of EGFR transcriptionally. It also suggests that loss of E-cad may induce proliferation of SCCHN by activating EGFR and its downstream signaling pathways.</p
Preparation and Characterization of Folate Targeting Magnetic Nanomedicine Loaded with Cisplatin
We used Aldehyde sodium alginate (ASA) as modifier to improve surfactivity and stability of magnetic nanoparticles, and folate acid (FA) as targeting molecule. Fe3O4 nanoparticles were prepared by chemical coprecipitation method. FA was activated and coupled with diaminopolyethylene glycol (NH2-PEG-NH2). ASA was combined with Fe3O4 nanoparticles, and FA-PEG was connected with ASA by Schiffâs base formation. Then Cl- in cisplatin was replaced by hydroxyl group in ASA, and FA- and ASA-modified cisplatin-loaded magnetic nanomedicine (CDDP-FA-ASA-MNPs) was prepared. This nanomedicine was characterized by transmission electron microscopy, dynamic lighterring scattering, phase analysis light scattering and vibrating sample magnetometer. The uptake of magnetic nanomedicine by nasopharyngeal and laryngeal carcinoma cells with folate receptor positive or negative expression were observed by Prussian blue iron stain and transmission electron microscopy. We found that CDDP-FA-ASA-MNPs have good water-solubility and stability. Mean diameter of Fe3O4 core was 8.17â±â0.24ânm, hydrodynamic diameters was 110.90±1.70ânm, and zeta potential was -26.45±1.26âmV. Maximum saturation magnetization was 22.20âemu/g. CDDP encapsulation efficiency was 49.05±1.58%â(mg/mg), and drug loading property was 14.31±0.49%â(mg/mg). In vitro, CDDP-FA-ASA-MNPs were selectively taken up by HNE-1 cells and Hep-2 cells, which express folate receptor positively
- âŠ