67 research outputs found
Hier-RTLMP: A Hierarchical Automatic Macro Placer for Large-scale Complex IP Blocks
In a typical RTL to GDSII flow, floorplanning or macro placement is a
critical step in achieving decent quality of results (QoR). Moreover, in
today's physical synthesis flows (e.g., Synopsys Fusion Compiler or Cadence
Genus iSpatial), a floorplan .def with macro and IO pin placements is typically
needed as an input to the front-end physical synthesis. Recently, with the
increasing complexity of IP blocks, and in particular with auto-generated RTL
for machine learning (ML) accelerators, the number of hard macros in a single
RTL block can easily run into the several hundreds. This makes the task of
generating an automatic floorplan (.def) with IO pin and macro placements for
front-end physical synthesis even more critical and challenging. The so-called
peripheral approach of forcing macros to the periphery of the layout is no
longer viable when the ratio of the sum of the macro perimeters to the
floorplan perimeter is large, since this increases the required stacking depth
of macros. In this paper, we develop a novel multilevel physical planning
approach that exploits the hierarchy and dataflow inherent in the design RTL,
and describe its realization in a new hierarchical macro placer, Hier-RTLMP.
Hier-RTLMP borrows from traditional approaches used in manual system-on-chip
(SoC) floorplanning to create an automatic macro placement for use with large
IP blocks containing very large numbers of hard macros. Empirical studies
demonstrate substantial improvements over the previous RTL-MP macro placement
approach, and promising post-route improvements relative to a leading
commercial place-and-route tool
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction
Although deep pre-trained language models have shown promising benefit in a
large set of industrial scenarios, including Click-Through-Rate (CTR)
prediction, how to integrate pre-trained language models that handle only
textual signals into a prediction pipeline with non-textual features is
challenging.
Up to now two directions have been explored to integrate multi-modal inputs
in fine-tuning of pre-trained language models. One consists of fusing the
outcome of language models and non-textual features through an aggregation
layer, resulting into ensemble framework, where the cross-information between
textual and non-textual inputs are only learned in the aggregation layer. The
second one consists of splitting non-textual features into fine-grained
fragments and transforming the fragments to new tokens combined with textual
ones, so that they can be fed directly to transformer layers in language
models. However, this approach increases the complexity of the learning and
inference because of the numerous additional tokens.
To address these limitations, we propose in this work a novel framework
BERT4CTR, with the Uni-Attention mechanism that can benefit from the
interactions between non-textual and textual features while maintaining low
time-costs in training and inference through a dimensionality reduction.
Comprehensive experiments on both public and commercial data demonstrate that
BERT4CTR can outperform significantly the state-of-the-art frameworks to handle
multi-modal inputs and be applicable to CTR prediction
Assessment of Reinforcement Learning for Macro Placement
We provide open, transparent implementation and assessment of Google Brain's
deep reinforcement learning approach to macro placement and its Circuit
Training (CT) implementation in GitHub. We implement in open source key
"blackbox" elements of CT, and clarify discrepancies between CT and Nature
paper. New testcases on open enablements are developed and released. We assess
CT alongside multiple alternative macro placers, with all evaluation flows and
related scripts public in GitHub. Our experiments also encompass academic
mixed-size placement benchmarks, as well as ablation and stability studies. We
comment on the impact of Nature and CT, as well as directions for future
research.Comment: There are eight pages and one page for reference. It includes five
figures and seven tables. This paper has been invited to ISPD 202
Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations
Today's performance analysis frameworks for deep learning accelerators suffer
from two significant limitations. First, although modern convolutional neural
network (CNNs) consist of many types of layers other than convolution,
especially during training, these frameworks largely focus on convolution
layers only. Second, these frameworks are generally targeted towards inference,
and lack support for training operations. This work proposes a novel
performance analysis framework, SimDIT, for general ASIC-based systolic
hardware accelerator platforms. The modeling effort of SimDIT comprehensively
covers convolution and non-convolution operations of both CNN inference and
training on a highly parameterizable hardware substrate. SimDIT is integrated
with a backend silicon implementation flow and provides detailed end-to-end
performance statistics (i.e., data access cost, cycle counts, energy, and
power) for executing CNN inference and training workloads. SimDIT-enabled
performance analysis reveals that on a 64X64 processing array, non-convolution
operations constitute 59.5% of total runtime for ResNet-50 training workload.
In addition, by optimally distributing available off-chip DRAM bandwidth and
on-chip SRAM resources, SimDIT achieves 18X performance improvement over a
generic static resource allocation for ResNet-50 inference
- …