334 research outputs found

    Navigation service with perspectives of digital technology developments in maritime sector

    Get PDF

    D-STEM: a Design led approach to STEM innovation

    Get PDF
    Advances in the Science, Technology, Engineering and Maths (STEM) disciplines offer opportunities for designers to propose and make products with advanced, enhanced and engineered properties and functionalities. In turn, these advanced characteristics are becoming increasingly necessary as resources become ever more strained through 21st century demands, such as ageing populations, connected communities, depleting raw materials, waste management and energy supply. We need to make things that are smarter, make our lives easier, better and simpler. The products of tomorrow need to do more with less. The issue is how to maximize the potential for exploiting opportunities offered by STEM developments and how best to enable designers to strengthen their position within the innovation ecosystem. As a society, we need designers able to navigate emerging developments from the STEM community to a level that enables understanding and knowledge of the new material properties, the skill set to facilitate absorption into the design ‘toolbox’ and the agility to identify, manage and contextualise innovation opportunities emerging from STEM developments. This paper proposes the blueprint for a new design led approach to STEM innovation that begins to redefine studio culture for the 21st Century

    Allo: A Programming Model for Composable Accelerator Design

    Full text link
    Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.Comment: Accepted to PLDI'2

    Decoupled Model Schedule for Deep Learning Training

    Full text link
    Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language to decouple model execution from definition. Specifically, the schedule works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, we optimize the model as-needed through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way, we are able to improve training throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.32x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM

    Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

    Full text link
    Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.Comment: Accepted for publication in the FCCM'24 Journal Track and will appear in ACM Transactions on Reconfigurable Technology and Systems (TRETS

    Formal Verification of Source-to-Source Transformations for HLS

    Get PDF
    Presented at: 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '24)[Abstract]: High-level synthesis (HLS) can greatly facilitate the description of complex hardware implementations, by raising the level of abstraction up to a classical imperative language such as C/C++, usually augmented with vendor-specific pragmas and APIs. Despite productivity improvements, attaining high performance for the final design remains a challenge, and higher-level tools like source-to-source compilers have been developed to generate programs targeting HLS toolchains. These tools may generate highly complex HLS-ready C/C++ code, reducing the programming effort and enabling critical optimizations. However, whether these HLS-friendly programs are produced by a human or a tool, validating their correctness or exposing bugs otherwise remains a fundamental challenge. In this work we target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to ensuring the correctness of the description provided to the HLS toolchain, by proving an optimized code version fully preserves the semantics of the unoptimized one. We introduce a novel formal verification approach that combines concrete and abstract interpretation with a hybrid symbolic analysis. Notably, our approach is mostly agnostic to how control-flow, data storage, and dataflow are implemented in the two programs. It can prove equivalence under complex bufferization and loop/syntax transformations, for a rich class of programs with statically interpretable control-flow. We present our techniques and their complete end-to-end implementation, demonstrating how our system can verify the correctness of highly complex programs generated by source-to-source compilers for HLS, and detect bugs that may elude co-simulation.This work was supported in part by an Intel ISRA award; U.S. NSF awards #1750399 and #2019306; ACE, one of seven centers in JUMP 2.0, an SRC program sponsored by DARPA; and Grant PID2022-136435NB-I00, funded by MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", EU. We are particularly thankful to Jin Yang, Jeremy Casas, and Zhenkun Yang from Intel for their support and guidance on the ISRA project. We also thank Lana Josipovi and the anonymous reviewers for their feedback on earlier versions of this manuscript.United States. National Science Foundation; 1750399United States. National Science Foundation; 201930

    SpanGNN: Towards Memory-Efficient Graph Neural Networks via Spanning Subgraph Training

    Full text link
    Graph Neural Networks (GNNs) have superior capability in learning graph data. Full-graph GNN training generally has high accuracy, however, it suffers from large peak memory usage and encounters the Out-of-Memory problem when handling large graphs. To address this memory problem, a popular solution is mini-batch GNN training. However, mini-batch GNN training increases the training variance and sacrifices the model accuracy. In this paper, we propose a new memory-efficient GNN training method using spanning subgraph, called SpanGNN. SpanGNN trains GNN models over a sequence of spanning subgraphs, which are constructed from empty structure. To overcome the excessive peak memory consumption problem, SpanGNN selects a set of edges from the original graph to incrementally update the spanning subgraph between every epoch. To ensure the model accuracy, we introduce two types of edge sampling strategies (i.e., variance-reduced and noise-reduced), and help SpanGNN select high-quality edges for the GNN learning. We conduct experiments with SpanGNN on widely used datasets, demonstrating SpanGNN's advantages in the model performance and low peak memory usage

    Downregulation of E-Cadherin enhances proliferation of head and neck cancer through transcriptional regulation of EGFR

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Epidermal growth factor receptor (EGFR) has been reported to downregulate E-cadherin (E-cad); however, whether the downregulation of E-cad has any effect on EGFR expression has not been elucidated. Our previous studies have found an inverse correlation between EGFR and E-cad expression in tissue specimens of squamous cell carcinoma of the head and neck (SCCHN). To understand the biological mechanisms underlying this clinical observation, we knocked down E-cad expression utilizing E-cad siRNA in four SCCHN cell lines.</p> <p>Results</p> <p>It was observed that downregulation of E-cad upregulated EGFR expression compared with control siRNA-transfected cells after 72 hours. Cellular membrane localization of EGFR was also increased. Consequently, downstream signaling molecules of the EGFR signaling pathway, p-AKT, and p-ERK, were increased at 72 hours after the transfection with E-cad siRNA. Reverse transcriptase-polymerase chain reaction (RT-PCR) showed EGFR mRNA was upregulated by E-cad siRNA as early as 24 hours. In addition, RT-PCR revealed this upregulation was due to the increase of EGFR mRNA stability, but not protein stability. Sulforhodamine B (SRB) assay indicated growth of E-cad knocked down cells was enhanced up to 2-fold more than that of control siRNA-transfected cells at 72-hours post-transfection. The effect of E-cad reduction on cell proliferation was blocked by treating the E-cad siRNA-transfected cells with 1 ÎŒM of the EGFR-specific tyrosine kinase inhibitor erlotinib.</p> <p>Conclusion</p> <p>Our results suggest for the first time that reduction of E-cad results in upregulation of EGFR transcriptionally. It also suggests that loss of E-cad may induce proliferation of SCCHN by activating EGFR and its downstream signaling pathways.</p

    Preparation and Characterization of Folate Targeting Magnetic Nanomedicine Loaded with Cisplatin

    Get PDF
    We used Aldehyde sodium alginate (ASA) as modifier to improve surfactivity and stability of magnetic nanoparticles, and folate acid (FA) as targeting molecule. Fe3O4 nanoparticles were prepared by chemical coprecipitation method. FA was activated and coupled with diaminopolyethylene glycol (NH2-PEG-NH2). ASA was combined with Fe3O4 nanoparticles, and FA-PEG was connected with ASA by Schiff’s base formation. Then Cl- in cisplatin was replaced by hydroxyl group in ASA, and FA- and ASA-modified cisplatin-loaded magnetic nanomedicine (CDDP-FA-ASA-MNPs) was prepared. This nanomedicine was characterized by transmission electron microscopy, dynamic lighterring scattering, phase analysis light scattering and vibrating sample magnetometer. The uptake of magnetic nanomedicine by nasopharyngeal and laryngeal carcinoma cells with folate receptor positive or negative expression were observed by Prussian blue iron stain and transmission electron microscopy. We found that CDDP-FA-ASA-MNPs have good water-solubility and stability. Mean diameter of Fe3O4 core was 8.17 ± 0.24 nm, hydrodynamic diameters was 110.90±1.70 nm, and zeta potential was -26.45±1.26 mV. Maximum saturation magnetization was 22.20 emu/g. CDDP encapsulation efficiency was 49.05±1.58% (mg/mg), and drug loading property was 14.31±0.49% (mg/mg). In vitro, CDDP-FA-ASA-MNPs were selectively taken up by HNE-1 cells and Hep-2 cells, which express folate receptor positively
    • 

    corecore