ERNet Family: Hardware-Oriented CNN Models for Computational Imaging
  Using Block-Based Inference by Huang, Chao-Tsung
ERNET FAMILY: HARDWARE-ORIENTED CNN MODELS FOR COMPUTATIONAL
IMAGING USING BLOCK-BASED INFERENCE
Chao-Tsung Huang
National Tsing Hua University, Taiwan, R.O.C.
ABSTRACT
Convolutional neural networks (CNNs) demand huge DRAM band-
width for computational imaging tasks, and block-based processing
has recently been applied to greatly reduce the bandwidth. How-
ever, the induced additional computation for feature recomputing or
the large SRAM for feature reusing will degrade the performance
or even forbid the usage of state-of-the-art models. In this paper, we
address these issues by considering the overheads and hardware con-
straints in advance when constructing CNNs. We investigate a novel
model family—ERNet—which includes temporary layer expansion
as another means for increasing model capacity. We analyze three
ERNet variants in terms of hardware requirement and introduce a
hardware-aware model optimization procedure. Evaluations on Full
HD and 4K UHD applications will be given to show the effective-
ness in terms of image quality, pixel throughput, and SRAM usage.
The results also show that, for block-based inference, ERNet can
outperform the state-of-the-art FFDNet and EDSR-baseline models
for image denoising and super-resolution respectively.
Index Terms— Convolutional neural network, computational
imaging, block-based inference, ultra-high-definition
1. INTRODUCTION
Convolutional neural networks (CNNs) have recently demonstrated
superior quality for not only computer vision but also computational
imaging applications. Their success in the computer vision field,
such as image recognition [1, 2] and object detection [3], has led to
the rising wave of deep learning. On the other hand, CNNs have
also taken the image quality of computational imaging into a new
level, e.g. image denoising [4, 5], super-resolution (SR) [6, 7, 8, 9],
style transfer [10, 11], and algorithm mimicking [12]. Therefore,
they have great potential to bring an image pipeline revolution on
cameras and displays for our daily use if their edge inference with
state-of-the-art quality is possible.
The most effective way of enabling high-quality inference at
resource-limited edge devices is to construct hardware-friendly CNN
models. Most of previous works focused on computer vision models
and aimed to reduce complexity based on model sparsity. For exam-
ple, MobileNet [13] applies depth-wise convolution to perform low-
rank computation, and SqueezeNet [14] temporarily reduces model
width and then expands back for residual connections. Furthermore,
MobileNetV2 [15] moves the connections to thinner layers to reduce
storage and results in an expansion-reduction structure for its build-
ing block. On the other hand, the complexity can also be reduced by
pruning small weights [16].
However, most of these complexity-reducing techniques do
not apply to computational imaging models because the sparsity
This work was supported by the Ministry of Science and Technology,
Taiwan, R.O.C. under Grant no. MOST 108-2218-E-007-020.
(a) ResNet [2] (b) MobileNet [13] (c) SqueezeNet [14]
(d) MobileNetV2 [15] (e) WDSR-B [9] (f) Proposed ERNet
Fig. 1. Building blocks of different networks with their model
widths (feature channel numbers) and convolution filters. (a) Resid-
ual blocks with the same width and only 3×3 filters. (b) Depth-wise
convolution. (c) Feature squeezing and then expanding. (d-e) Fea-
ture expanding and then reducing both with 1×1 filters. (f) Three
variants of the proposed expansion-reduction structure.
assumption could become invalid. For instance, weight pruning
and depth-wise convolution were found to cause serious quality
degradation for denoising and SR [17]. Moreover, the huge DRAM
bandwidth demanded by high-end applications, which is a major
bottleneck for edge inference, was not considered at all. Recently,
a network called WDSR [9] was proposed to restructure SR models
using also feature expansion/reduction and low-rank convolution for
better tradeoffs between quality and complexity; nevertheless, the
bandwidth issue was still not discussed.
For computational imaging models, the conventional layer-by-
layer inference flow demands huge DRAM bandwidth for feature
maps. And block-based processing can eliminate this traffic by com-
puting convolution layers altogether for each small image block.
For example, the layer fusion in [18] proposes a pyramid inference
flow and suggests to reuse the overlapped features between moving
blocks. In contrast, the truncated-pyramid inference flow in [17] pro-
poses to recompute these features for saving SRAM area. But the re-
computing overhead is increased quickly with model depth and thus
could forbid the usage of deep networks.
In this paper, we aim to investigate efficient model construc-
tion for high-resolution computational imaging tasks. In particular,
we focus on the models designed for block-based processing flows
to enable bandwidth-efficient edge inference. Unlike the previous
works which passively slim or restructure existing networks, we ac-
tively construct hardware-efficient models by using the expansion-
reduction structure in Fig. 1(f) and considering the expansion ratio
as a major model hyperparameter. The main contribution of this
paper is to extend the basic idea of ERNet in our previous work
[17] (E3R1 variant for only feature recomputing) to a comprehensive
ar
X
iv
:1
91
0.
05
78
7v
1 
 [c
s.L
G]
  1
3 O
ct 
20
19
study which further includes two more variants, the feature reusing
scheme, SRAM requirement analysis, and extensive evaluations for
denoising and SR applications.
The rest of this paper is organized as follows. In Section 2, two
block-based inference flows using feature recomputing and reusing,
respectively, are introduced. In Section 3, we discuss the ERNet
family in terms of model structure, SRAM requirement, and model
optimization. Extensive evaluations on Full-HD and 4K-UHD use
cases will be presented in Section 4 to show the advantages over the
state-of-the-art FFDNet [5] and EDSR-baseline [8] for denoising and
SR respectively. Finally, concluding remarks are given in Section 5.
2. BLOCK-BASED INFERENCE FLOWS
The conventional inference flow is to compute features layer-by-
layer for one whole image. It reuses parameters efficiently but con-
sumes huge DRAM bandwidth for storing and reading back the fea-
tures in the internal layers. For example, such bandwidth for sup-
porting FFDNet (12-layer, 96-channel) at 4K UHD 30fps is up to
3840/2 · 2160/2 · 96Bytes · (12− 1) · 30Hz · 2 = 131GB/s, (1)
where 8-bit features are assumed. To eliminate such bandwidth, we
consider block-based inference which partitions one image into sev-
eral blocks and performs layer-by-layer inference for each block.
Then the features can be stored in on-chip buffers, instead of DRAM.
However, the block-based inference induces two overheads. The
first one is on-chip block buffers to store intermediate features. It
will affect the efficiency of parameter reuse since each block needs
to reload the whole model again; therefore, a small block size could
deteriorate computing performance. The second one is the need to
handle the overlapped features between block boundaries since the
receptive field is larger than one pixel. To simplify the discussion,
we will consider the case in which we have a sufficiently large block
size and focus on the feature overlapping issue in the following.
Two different approaches can be applied to handle the over-
lapped features. One is to recompute them for each block as [17],
and the other one is to reuse them with additional storage similarly
to the layer fusion [18]. To illustrate the corresponding inference
flows in one block, we show their 1-D cross-sectional views in Fig.
2. The feature recomputing results in a truncated-pyramid flow for
which the processed block will get smaller when going to deeper
layers. In contrast, the feature reusing forms an oblique-cuboid flow
which has the same input and output block size.
Now we can analyze their overheads for evaluating design trade-
offs. The recomputing approach demands additional computation
(green part in Fig. 2(a)), and the amount will increase quickly as
model depth goes deeper. For example, consider a plain network
with only 3×3 filters. The ratio of the recomputing overhead to the
original complexity is 2
3
β(3−4β)
(1−2β)2 where β =
D
S
for model depth D
and block width S. With a 128×128 block, the ratio is only 0.5 for
a 20-layer plain network but it will go up to 2.6 for a 40-layer one.
Therefore, deep networks could be unfavorable.
On the other hand, the reusing approach avoids additional com-
puting by storing the already-computed features for neighbor blocks.
Assume we move the blocks from left to right for a plain network.
For reusing in the vertical direction, we need to store two horizontal
lines of features in widthW for each of input or internal layers. Sim-
ilarly, two another vertical stripes in block height S are required for
the horizontal reuse. For a plain D-layer C-channel network with
an image channel number Cin, the size of the line buffers will be as
high as 2(W+S)(Cin+(D−1)C). For example, FFDNet requires
up to 4.0MB of SRAM as the line buffers for 4K UHD resolution.
(a) Feature recomputing (b) Feature reusing
Fig. 2. Cross-sectional views for two block-based flows with 3×3
filters. Each circle represents an input/output pixel or a feature. The
inference in one block is performed from the input block (red) at the
bottom to the output block (orange) at the top.
(a) ERModule (b) Connected ERModules
(c) DnERNet-12ch (d) SR4ERNet
Fig. 3. ERNet family. (a) Three variants of building blocks (ER-
Modules): E3R1, E1R3, and E3R3. (b) Connecting several ERMod-
ules for a fractional expansion ratio RN
B
. (c) Exemplar denoising
network using ERNet. (d) Exemplar SR×4 network using ERNet.
3. ERNet FAMILY
In the following, we will propose a novel CNN family—ERNet—to
construct hardware-efficient models which are aware of the over-
heads resulted from block-based inference. We will first introduce
the model structures and their SRAM requirement and then present
hardware-oriented model optimization procedures.
3.1. Model structure
Suppose we want to build a high-quality network and start from a
small one which has few layers and few feature channels. The con-
ventional approach to add model capacity is to increase model depth
D and/or model width C; however, it could cause significant over-
heads for block-based inference. For example, the block buffer size
is proportional to C and the line buffer size for feature reusing is
nearly proportional to D · C. Also, the computation overhead for
feature recomputing is increased very quickly for a large depth D.
To overcome this difficulty, we propose to use temporary layer
expansion as an additional means for model construction. Instead
of simply adding depth or width, we can pump capacity into the
network by temporarily expanding the channels as shown in Fig.
1(f). Since the channel reduction is performed immediately after
the expansion, the effective model width is not enlarged at all. In
other words, we now have the expansion ratio as an additional model
hyperparameter besides depth D and width C.
With this idea, we devise three variants of build blocks named
ERModules as shown in Fig. 3(a). They contain at least one 3×3 fil-
ter to increase receptive field and use another 1×1 or 3×3 filter for
Table 1. SRAM requirement for ERModules.
Table 2. Performance targets and computing capability.
the expansion-reduction purpose. For increasing model flexibility,
we may use fractional expansion ratios. However, this could cause
low hardware utilization for highly-parallel acceleration. Instead, we
consider only integer expansion ratios in every ERModule and con-
struct a larger building block by connecting B of them as shown in
Fig. 3(b). Now we can have an equivalent fractional expansion ratio
RE = R
N
B
by setting the expansion ratios of the firstN ERModules
as R+ 1 and those of the rest as R.
Based on the connected ERModules, we construct two ERNets
for denoising and SR×4, respectively, in Fig. 3(c-d). The DnERNet-
12ch uses the same downsampling strategy in FFDNet [5] with an
additional skip connection, and the SR4ERNet replaces the Res-
Blocks in EDSR-baseline by ERMdoules. We use only 32-ch fea-
tures for the input and output of ERModules to save on-chip buffers.
Finally, we have two model hyperparameters to build networks: B
for increasing depth and RE for pumping complexity.
3.2. SRAM requirement
The computing overheads for the feature recomputing flow are
mainly related to the number of 3×3 layers. However, the required
line buffer sizes for the feature reusing flow are quite different for the
three ERModule variants. Their SRAM requirement is summarized
in Table 1. For comparison, we also include the conventional build-
ing block CONV3×3 (R0.5C-ch to R0.5C-ch 3×3 filter). Note that
the E3R1 and E1R3 variants needs 11% of more computation for
one equivalent 3×3 layer due to the reduction and expansion 1×1
filters respectively.
ERModules can use smaller block buffers than the CONV3×3
building block as expected. For the feature reusing flow, the required
line buffer size is determined be the input channel number of each
3×3 filter. For example, the 3×3 filter of E3R1 has only C-ch in-
put features while that of E1R3 works on expanded RC-ch features;
therefore, E1R3 will require R× of line buffers. For fair compar-
ison, we consider the normalized line buffer sizes per 3×3 layer.
Then the four building blocks in Table 1 can be ranked in order of
the normalized sizes: E3R1, CONV3×3, E3R3, and E1R3. And
their corresponding ratios are 1 : R0.5 : 1+R
2
: R. As a result, we
will prefer to use E3R1 for the feature reusing flow. Incidentally,
MobileNetV2 [15] will require the same large line buffers as E1R3
since it applies depth-wise 3×3 filters on expanded channels.
3.3. Model optimization
We have two hyperparameters,B andRE , to build ERNets, and they
can be used to optimize the models for given hardware constraints.
In particular, we can find a maximum expansion ratio RE for ev-
ery considered ERModule numberB such that the total computation
(a) HD30 (Target A) (b) UHD30 (Target B)
Fig. 4. DnERNet-12ch scanning for the feature recomputing flow.
The performance targets are (a) Full HD 30fps (HD30) and (b) 4K
UHD 30fps (UHD30). Validation with ten DIV2K images [19].
(a) Denoising, HD40 (Target C) (b) SR×4, HD60 (Target E)
Fig. 5. ERNet-E3R1 scanning for the feature reusing flow. The
performance targets are (a) denoising at HD40 with≤4.0MB of line
buffers (LB4.0) and (b) SR×4 at HD60 with LB4.8.
complexity is smaller than or equal to given computing capability.
Thus we can perform model scanning in terms of B, or the corre-
sponding depth D, and pick the best model using validation quality.
After that, we can further polish the model with a heavier training
setting to improve quality.
For feature recomputing, we show two examples in Fig. 4 for
DnERNet-12ch scanning to support Full HD and 4K UHD applica-
tions (performance targets A and B in Table 2). The three variants all
follow a similar trend: deeper networks are not always preferred any
more due to their higher recomputing overheads. Similarly, we show
two examples for feature reusing in Fig. 5 to support Full-HD de-
noising and SR tasks (targets C and E in Table 2). Here, besides the
computing constraint, we have additional limits on the physical line
buffer size and thus use them to set upper bounds for model depth.
4. EVALUATION
4.1. Training and hardware settings
We use the same datasets and follow the same training settings in
[17]. We also train conventional model structures for comparison.
We use 96-ch FFDNet as the denoising baseline; however, we re-
move BN layers and add a skip connection to improve training speed
and stability. The resultant model is called FFDNet*. For the SR
baseline, we directly use the 64-ch EDSR-baseline in [8] which has
the same model capability as SRResNet [7].
The performance targets and computing capability are listed in
Table 2. We focus on high-performance Full-HD and 4K-UHD ap-
plications, and the computing capability is set to the same level as
the eCNN processor in [17]. For each feature reusing target, we set
the additional line buffer limit as the line buffer size of the corre-
sponding baseline model; in addition, we consider only the E3R1
variant for its smaller line buffer usage.
Table 3. PSNR (dB) of polished models for denoising.
Table 4. PSNR (dB) of polished models for SR×4.
4.2. Image quality
For denoising, we summarize the PSNR values of the polished mod-
els on test datasets in Table 3. We also include two reference num-
bers from the benchmark BM3D [20] and the original FFDNet [5].
We use twelve and five layers of FFDNet* as the baseline models
for Full-HD and 4K-UHD throughputs respectively. The E3R1 and
E1R3 variants constantly show about 0.15-0.44 dB of PSNR gains
over the baselines for all performance targets. However, the E3R3
variant has significant quality drops because it has much fewer non-
linear layers, in particular for the target A.
Similarly, we list the results for SR×4 in Table 4 and also in-
clude VDSR [6] and SRResNet as reference numbers. In this case,
all ERModule variants show similar performance gains since they
are all sufficiently deep for the training dataset. In contrast, EDSR-
baseline-B1 is up to 1.2 dB worse for the targets B and E due to its
shallow five-layer depth.
4.3. Performance for feature recomputing
We assume there are three block buffers deployed as the feature
operands in [17]. Then the bubble chart in Fig. 6 presents the per-
formance benefits of the ERNets for the feature recomputing flow,
where we only show the E3R1 variant for simplicity. The bubble
area represents the size of the deployed block buffers, and ERNets
can use smaller block buffers for their fewer input/output channels in
building blocks. Therefore, ERNets provide better image quality and
smaller block buffers than the conventional models while delivering
the same or even higher pixel throughputs.
4.4. Performance for feature reusing
For the feature reusing flow, we show the performance comparison
in the bubble chart Fig. 7 and use the bubble area to represent the line
(a) Denoising (DnERNet-12ch) (b) SR×4 (SR4ERNet)
Fig. 6. Performance with the feature recomputing flow for (a) de-
noising and (b) SR×4 applications. (BB: Block buffer)
(a) Denoising (DnERNet-12ch) (b) SR×4 (SR4ERNet)
Fig. 7. Performance with the feature reusing flow for (a) denoising
and (b) SR×4 applications. (LB: Line buffer)
buffer size since it is a major hardware cost. Here we include two ad-
ditional models, DnERNet-12ch-E3R1-B23R4N0 and SR4ERNet-
E3R1-B34R4N0, to show that they can deliver similar image quality
while using smaller line buffers than their counterparts, DnERNet-
12ch-E3R1-B28R3N9 and SR4ERNet-E3R1-B61R3N25, in our op-
timization procedure. From the results, we can draw a similar con-
clusion that ERNets provide better quality and smaller line buffers
with similar or even higher throughputs.
4.5. ReLU before residual connection
In Fig. 3, we only apply ReLU in the middle of the expansion and re-
duction filters. Alternatively, we can apply another ReLU before the
residual connection to increase non-linearity. We find that this alter-
native provides similar image quality in general and can compensate
the cases of shallow model depth. For example, it raises 0.12 dB for
the five-layer E3R3-B5R2N0 for denoising in Table 3.
5. CONCLUSION
In this paper, we discuss a novel model family—ERNet—to con-
struct hardware-oriented CNN models. It is devised for block-
based inference flows which reduce excessive DRAM bandwidth
by feature recomputing or reusing. When building high-quality
networks, it additionally considers temporary expansion-reduction
layers which do not induce the hardware overheads for model deep-
ening or widening. According to the experiments for denoising
and SR tasks, it can achieve better image quality and even higher
pixel throughputs than the conventional model structures while us-
ing smaller block buffers and line buffers. We also believe that
ERNets can be extended to enhance image quality and hardware
performance for more CNN tasks on resource-limited edge devices.
6. REFERENCES
[1] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” in International
Conference on Learning Representations (ICLR), 2015.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn-
ing for image recognition,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[3] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection
with deep learning: A review,” IEEE Transactions on Neural
Networks and Learning Systems, 2019.
[4] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond
a Gaussian denoiser: Residual learning of deep CNN for image
denoising,” IEEE Transactions on Image Processing, vol. 26,
2017.
[5] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a fast
and flexible solution for CNN-based image denoising,” IEEE
Transactions on Image Processing, vol. 27, 2018.
[6] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image
super-resolution using very deep convolutional networks,” in
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016.
[7] C. Ledig, L. Theis, F. Husza´r, J. Caballero, A. Cunningham,
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,
“Photo-realistic single image super-resolution using a genera-
tive adversarial network,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[8] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced
deep residual networks for single image super-resolution,”
arXiv:1707.02921, 2017.
[9] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T.
Huang, “Wide activation for efficient and accurate image
super-resolution,” arXiv:1808.08718, 2018.
[10] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for
real-time style transfer and super-resolution,” in European
Conference on Computer Vision (ECCV), 2016.
[11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired
image-to-image translation using cycle-consistent adversarial
networks,” in IEEE International Conference on Computer Vi-
sion (ICCV), 2017.
[12] Q. Chen, J. Xu, and V. Koltun, “Fast image processing with
fully-convolutional networks,” in IEEE International Confer-
ence on Computer Vision (ICCV), 2017.
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Effi-
cient convolutional neural networks for mobile vision applica-
tions,” arXiv:1704.04861, 2017.
[14] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accu-
racy with 50x fewer parameters and <0.5MB model size,”
arXiv:1602.07360, 2017.
[15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
Chen, “MobileNetV2: Inverted residuals and linear bottle-
necks,” in IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2018.
[16] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-
pressing deep neural networks with pruning, trained quanti-
zation and Huffman coding,” in International Conference on
Learning Representations (ICLR), 2016.
[17] C.-T. Huang, Y.-C. Ding, H.-C. Wang, C.-W. Weng, K.-P. Lin,
L.-W. Wang, and L.-D. Chen, “eCNN: A block-based and
highly-parallel CNN accelerator for edge inference,” in Pro-
ceedings of the 52nd Annual IEEE/ACM International Sympo-
sium on Microarchitecture (MICRO), 2019.
[18] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-
layer CNN accelerators,” in Proceedings of the 49th An-
nual IEEE/ACM International Symposium on Microarchitec-
ture (MICRO), 2016.
[19] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on sin-
gle image super-resolution: Dataset and study,” in IEEE Con-
ference on Computer Vision and Pattern Recognition Work-
shops (CVPRW), 2017.
[20] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image
denoising by sparse 3-D transform-domain collaborative filter-
ing,” IEEE Transactions on Image Processing, vol. 16, 2007.
