52 research outputs found
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
NVDLA is an open-source deep neural network (DNN) accelerator which has
received a lot of attention by the community since its introduction by Nvidia.
It is a full-featured hardware IP and can serve as a good reference for
conducting research and development of SoCs with integrated accelerators.
However, an expensive FPGA board is required to do experiments with this IP in
a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA,
it would be hard to do accurate performance analysis with such a setup. To
overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the
Amazon cloud FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We
then evaluate the performance of NVDLA by running YOLOv3 object-detection
algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3.
We further analyze the performance by showing that sharing the last-level cache
with NVDLA can result in up to 1.56x speedup. We then identify that sharing the
memory system with the accelerator can result in unpredictable execution time
for the real-time tasks running on this platform. We believe this is an
important issue that must be addressed in order for on-chip DNN accelerators to
be incorporated in real-time embedded systems.Comment: Presented at the 2nd Workshop on Energy Efficient Machine Learning
and Cognitive Computing for Embedded Applications (EMC2'19
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Using FPGAs to accelerate ConvNets has attracted significant attention in
recent years. However, FPGA accelerator design has not leveraged the latest
progress of ConvNets. As a result, the key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting GOPs, and
results on accuracy, which is critical to application success, are often not
even reported. In this work, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel ConvNet model
called DiracDeltaNet. Both the accelerator and ConvNet are tailored
to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with
only convolutions while spatial convolutions are replaced by more
efficient shift operations. DiracDeltaNet achieves competitive accuracy on
ImageNet (88.7\% top-5), but with 42 fewer parameters and 48
fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and
activations to 4-bits, with less than 1\% accuracy loss. These quantizations
exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model
size, low computational OP count, low precision and simplified operators allow
us to co-design a highly customized computing unit for an FPGA. We implement
the computing units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on
ImageNet, is higher than all the previously reported embedded FPGA
accelerators. In addition, the accelerator reaches an inference speed of 66.3
FPS on the ImageNet classification task, surpassing prior works with similar
accuracy by at least 11.6.Comment: Update to the latest result
Extensive analysis of D7S486 in primary gastric cancer supports TESTIN as a candidate tumor suppressor gene
<p>Abstract</p> <p>Background</p> <p>High frequency of loss of heterozygosity (LOH) was found at D7S486 in primary gastric cancer (GC). And we found a high frequency of LOH region on 7q31 in primary GC from China, and identified D7S486 to be the most frequent LOH locus. This study was aimed to determine what genes were affected by the LOH and served as tumor suppressor genes (TSGs) in this region. Here, a high-throughput single nucleotide polymorphisms (SNPs) microarray fabricated in-house was used to analyze the LOH status around D7S486 on 7q31 in 75 patients with primary GC. Western blot, immunohistochemistry, and RT-PCR were used to assess the protein and mRNA expression of TESTIN (TES) in 50 and 140 primary GC samples, respectively. MTS assay was used to investigate the effect of TES overexpression on the proliferation of GC cell lines. Mutation and methylation analysis were performed to explore possible mechanisms of TES inactivation in GC.</p> <p>Results</p> <p>LOH analysis discovered five candidate genes (<it>ST7</it>, <it>FOXP2</it>, <it>MDFIC</it>, <it>TES </it>and <it>CAV1</it>) whose frequencies of LOH were higher than 30%. However, only <it>TES </it>showed the potential to be a TSG associated with GC. Among 140 pairs of GC samples, decreased <it>TES </it>mRNA level was found in 96 (68.6%) tumor tissues when compared with matched non-tumor tissues (<it>p </it>< 0.001). Also, reduced TES protein level was detected in 36 (72.0%) of all 50 tumor tissues by Western blot (<it>p </it>= 0.001). In addition, immunohistochemical staining result was in agreement with that of RT-PCR and Western blot. Down regulation of TES was shown to be correlated with tumor differentiation (<it>p </it>= 0.035) and prognosis (<it>p </it>= 0.035, log-rank test). Its overexpression inhibited the growth of three GC cell lines. Hypermethylation of <it>TES </it>promoter was a frequent event in primary GC and GC cell lines. However, no specific gene mutation was observed in the coding region of the <it>TES </it>gene.</p> <p>Conclusions</p> <p>Collectively, all results support the role of <it>TES </it>as a TSG in gastric carcinogenesis and that <it>TES </it>is inactivated primarily by LOH and CpG island methylation.</p
Full Stack Optimization of Transformer Inference: a Survey
Recent advances in state-of-the-art DNN architecture design have been moving
toward Transformer models. These models achieve superior accuracy across a wide
range of applications. This trend has been consistent over the past several
years since Transformer models were originally introduced. However, the amount
of compute and bandwidth required for inference of recent Transformer models is
growing at a significant rate, and this has made their deployment in
latency-sensitive applications challenging. As such, there has been an
increased focus on making Transformer models more efficient, with methods that
range from changing the architecture design, all the way to developing
dedicated domain-specific accelerators. In this work, we survey different
approaches for efficient Transformer inference, including: (i) analysis and
profiling of the bottlenecks in existing Transformer architectures and their
similarities and differences with previous convolutional models; (ii)
implications of Transformer architecture on hardware, including the impact of
non-linear operations such as Layer Normalization, Softmax, and GELU, as well
as linear operations, on hardware design; (iii) approaches for optimizing a
fixed Transformer architecture; (iv) challenges in finding the right mapping
and scheduling of operations for Transformer models; and (v) approaches for
optimizing Transformer models by adapting the architecture using neural
architecture search. Finally, we perform a case study by applying the surveyed
optimizations on Gemmini, the open-source, full-stack DNN accelerator
generator, and we show how each of these approaches can yield improvements,
compared to previous benchmark results on Gemmini. Among other things, we find
that a full-stack co-design approach with the aforementioned methods can result
in up to 88.7x speedup with a minimal performance degradation for Transformer
inference
- …