A Machine Learning Imaging Core using Separable FIR-IIR Filters by Asama, Masayoshi et al.
A MACHINE LEARNING IMAGING CORE
USING SEPARABLE FIR-IIR FILTERS
Masayoshi Asama 1 Leo F. Isikdogan 1 Sushma Rao 1 Bhavin V. Nayak 1 Gilad Michael 1
ABSTRACT
We propose fixed-function neural network hardware that is designed to perform pixel-to-pixel image transforma-
tions in a highly efficient way. We use a fully trainable, fixed-topology neural network to build a model that can
perform a wide variety of image processing tasks. Our model uses compressed skip lines and hybrid FIR-IIR
blocks to reduce the latency and hardware footprint. Our proposed Machine Learning Imaging Core, dubbed
MagIC, uses a silicon area of ∼3mm2 (in TSMC 16nm), which is orders of magnitude smaller than a comparable
pixel-wise dense prediction model. MagIC requires no DDR bandwidth, no SRAM, and practically no external
memory. Each MagIC core consumes 56mW (215 mW max power) at 500MHz and achieves an energy-efficient
throughput of 23TOPS/W/mm2. MagIC can be used as a multi-purpose image processing block in an imaging
pipeline, approximating compute-heavy image processing applications, such as image deblurring, denoising, and
colorization, within the power and silicon area limits of mobile devices.
1 INTRODUCTION
Many convolutional neural network (CNN) architectures
that make dense, pixel-wise predictions, such as FCN (Long
et al., 2015), U-Net (Ronneberger et al., 2015), and their
variants, use very long skip lines. Those skip lines are cru-
cial for recovering the details lost during downsampling.
However, hardware implementations of those networks re-
quire a large memory to hold those skip lines. The skip lines
are often stored in external memory such as SRAM or DDR,
which dramatically increases the cost in terms of silicon
area footprint. Furthermore, the skip connections cause ad-
ditional latency, which would hinder real-time applications
of the system.
In real-time imaging systems, images are acquired line-by-
line by the raster scan order. Therefore, an efficient hard-
ware implementation of a CNN that runs in such a system
needs to be fully-pipelined. However, it is a challenging
task to implement a CNN topology that has long skip con-
nections in fully-pipelined hardware in a cost-effective way.
The main reason is that skip lines need to compensate for
all vertical delays of the entire network.
In a line based system, the vertical line delay is accumu-
lated in every convolution layer. For example, a 3x3 spatial
window would cause a 1-line delay, whereas two consecu-
1Intel Corporation, Santa Clara, CA. Correspondence to: Leo
F. Isikdogan <leo.f.isikdogan@intel.com>, Masayoshi Asama
<masayoshi.asama@intel.com>.
Vertical IIR
(Spatially Recurrent)
Horizontal FIR
(1x5 Conv)
Pointwise FIR
(1x1xC Conv)
Figure 1. We break down a convolution layer into separable convo-
lutions in all three dimensions and replace the vertical component
with an IIR filter. The vertical IIR reduces the vertical delay in a
line-based hardware implementation, leading to significant savings
in silicon area.
tive 3x3 convolutions would result in a 2-line delay. The
problem with the long skip lines is that once the data on
one end of the skip line is generated, it needs to be held in
memory until the data in the receiving end of the skip con-
nection is ready. The more layers a connection skips over,
the more lines need to be kept in memory. Therefore, the
size of the total memory required increases with the length
of the skip line. The memory requirements for the skip lines
can aggregate quickly and become a significant contributor
to the total silicon area needed to implement the network.
Moreover, the latency caused by the accumulated line delay
can also be problematic in latency-sensitive applications
such as autonomous driving systems.
A naive way to reduce the vertical delay in such a model
ar
X
iv
:2
00
1.
00
63
0v
1 
 [e
es
s.I
V]
  2
 Ja
n 2
02
0
A Machine Learning Imaging Core using Separable FIR-IIR Filters
would be to reduce the number of pooling and convolu-
tion layers. However, this would reduce the receptive field,
which dictates the size of the area in the input that can affect
a single pixel at the output. The more layers and scales a
neural network has and the bigger kernels are, the larger the
receptive field becomes.
For image processing tasks, a large receptive field is needed
to be able to process medium and low spatial frequencies.
Furthermore, many tasks that lie at the intersection of image
processing and computer vision need a sizeable receptive
field to be able to make context-aware decisions. For exam-
ple, an image colorization model would need semantic cues
that span a large portion of its input to detect the sky and
process the pixels accordingly.
We propose a hardware-friendly neural network topology
that maintains a large receptive field without producing large
vertical delay lines. Our method significantly reduces the
memory requirements while reducing end-to-end latency by
replacing some of the finite impulse response (FIR) filters
with infinite impulse response (IIR) filters and compressing
the skip lines. We use this model to implement a Machine
Learning Imaging Core (MagIC) as fixed-function hardware
having configurable parameters.
MagIC can potentially work in multiple locations in an imag-
ing pipeline to implement or complement certain features in
pre-processing, post-processing, or anywhere in between in
the pipe. For example, MagIC can
• improve the image quality by learning a mapping be-
tween the outputs of low-cost and high-end image sig-
nal processors (ISPs);
• approximate compute-heavy image processing opera-
tors, such as denoising and deblurring algorithms;
• recover missing color information from the context,
such as converting RCCC (Red/Clear) images used in
advanced driver assistance systems to full color RGB
images;
• process single or stereo camera input to create depth
maps;
• demosaic non-traditional color filter array images, such
as hybrid RGB-IR and spatially varying exposures.
The flexibility of MagIC would help address customer-
specific requests without changing the underlying ISP hard-
ware. This would help create low-cost, yet powerful imaging
systems.
2 RELATED WORK
Our machine learning imaging core can be considered a
type of multi-purpose image signal processor. Image signal
processors (ISPs) typically implement a fully pipelined im-
age processing architecture that makes use of line buffers to
store all intermediate data between different stages of pro-
cessing. This architectural pattern provides highly efficient
image processing pipelines that achieve high throughput.
Prior work on building optimized imaging pipelines, such as
Darkroom (Hegarty et al., 2014) and FlexISP (Heide et al.,
2014), mainly focused on implementing particular image
processing algorithms efficiently on hardware. Although
defining image processing operators as fixed-function ASIC
blocks generally improves the performance of a system,
many algorithms can be still too complex to run on low
power environments.
Recent work (Chen et al., 2017; Gharbi et al., 2017) showed
that many sophisticated image processing operators could
be efficiently approximated using convolutional neural net-
works. Practically, a U-Net-like (Ronneberger et al., 2015)
neural network topology that outputs pixel-wise labels can
approximate virtually any image processing operator, al-
though implementing the U-Net as-is in hardware would be
very costly.
Our work implements a convolutional neural network as a
fully-pipelined, multi-purpose imaging block. Prior work
on fully-pipelined hardware implementations of neural net-
works (Wu et al., 2019b; Whatmough et al., 2019) focused
on computer vision tasks. For example, the VisionISP
pipeline (Wu et al., 2019a) used a trainable vision scaler
(TVS), which implemented a shallow convolutional neu-
ral network as in ISP block to perform vision-aware image
downscaling. It is indeed technically possible to train TVS
to approximate some image processing operators. However,
TVS was designed to be a pre-processor for computer vision
tasks and would not have sufficient model capacity as a stan-
dalone module to perform advanced imaging tasks, such as
image colorization (Zhang et al., 2016) and advanced image
deblurring (Nah et al., 2017).
Another fixed-topology neural network hardware, called
FixyNN (Whatmough et al., 2019), froze the first layers of a
MobileNet (Howard et al., 2017) model and implemented it
as a fixed feature extractor. Although this approach worked
well for image classification, using such feature extractor as
an image processing block would be challenging. Images
in an ISP pipeline can have very different characteristics,
such as having different lens shading and noise profiles. The
location of the block in a pipe can also change the proper-
ties of the input. For example, images would look different
before and after denoising and sharpening. A model would
need to have configurable parameters to have the flexibil-
ity to adapt to different types of inputs. Furthermore, the
amount of downscaling done in image classification feature
extractors makes them unfeasible for learning pixel-to-pixel
transformations.
A Machine Learning Imaging Core using Separable FIR-IIR Filters
Residual
Convolution Block
Group = 3
Residual
Hybrid FIR-IIR
Convolution Block
Max Pooling
Residual
Convolution Block
Group = 36
Max Pooling
H/4 x W/4 x 18
H/16 x W/16 x 36
Channel
Compression
(1x1 Conv)
DPCM
Encoding
Input
H x W x 6
H
 x
 W
 x
18
H
 x
 W
 x
6
Channel
Compression
(1x1 Conv)
H
/4
x 
W
/4
x
36
Channel
Compression
(1x1 Conv)
H
/1
6
x 
W
/1
6
x
72
UpscalingH/16 x W/16 x 12
H/4 x W/4 x 12
H/4 x W/4 x 12
DPCM
Decoding
5
bi
ts
 p
er
 p
ix
el
H/4 x W/4 x 12
H x W x 6
1x1 Conv
3x3
Conv
H
/4
x 
W
/4
x
6
H
 x
 W
 x
6
Output
H x W x 6
C
on
ca
t
Upscaling
3x3
Conv
C
on
ca
t
Group
Convolution
(3x3)
Pointwise
Conv
(1x1)
IIR-FIR
Separable
Convolution
IIR-FIR
Separable
Convolution
+
H
 x
 W
 x
2C
H
 x
 W
 x
2CChannel
Expansion
(1x1 Conv)
H
 x
 W
 x
 C
H
 x
 W
 x
2C IIR-FIR
Separable
Convolution
IIR-FIR
Separable
Convolution
+
H
 x
 W
 x
2C
H
 x
 W
 x
2C
Vertical IIR
(3x1)
Horizontal
Conv
(1x5)
Pointwise
Conv
(1x1)
Residual Hybrid FIR-IIR Convolution Block
H
 x
 W
 x
2C
Group
Convolution
Block
Group
Convolution
Block
+
H
 x
 W
 x
2C
H
 x
 W
 x
2CChannel
Expansion
(1x1 Conv)
H
 x
 W
 x
 C
H
 x
 W
 x
2C Group
Convolution
Block
Group
Convolution
Block
+
H
 x
 W
 x
2C
H
 x
 W
 x
2C
Residual Convolution Block
H
 x
 W
 x
2C
Figure 2. Architecture of the network that we use to implement our machine learning imaging core. Our model uses compressed skip lines
and hybrid FIR-IIR blocks to reduce the amount of data that needs to be line-buffered in fully-pipelined hardware.
3 MODEL ARCHITECTURE
In this section, we describe the topology of the network
that we use to implement our machine learning imaging
core. We explain the design choices we made to build a
hardware-efficient model.
3.1 Skip-connection Compression
The macro architecture of our model is a typical, U-Net
style (Ronneberger et al., 2015), encoder-decoder network
that has skip connections between layers at the same spatial
resolution (Figure 2). One challenge about implementing
a U-Net-like model is the memory cost associated with the
long skip connections. Indeed, it is possible to remove
the skip lines altogether to save hardware area. However,
removing even only the longest skip connection results in a
drastic drop in output image quality as those skip lines help
recover the spatial granularity that is lost after downscaling
layers.
Instead of removing the skip connections, we compress the
data carried over them to reduce the memory requirement
of the model. First, we reduce the number of channels on a
skip line buffer by using point-wise convolutions, acting as
trainable linear projection layers. After this channel-wise
compression, we also use differential pulse code modulation
(DPCM) (Cutler, 1952) to reduce the number of bits needed
to store each pixel on the buffer. We use DPCM compression
only on the longest skip line, where the silicon area cost of
the DPCM encoder and decoder are negligible as compared
to the cost of the skip line buffers.
Overall, compressing the skip lines allows us to use only
internal memories and no external memory for the entire
inference operation.
3.2 Separable FIR-IIR Filters
The concept of separable convolutions is commonly used
to design efficient neural network architectures, particularly
in the form of depthwise-separable convolutions (Howard
et al., 2017; Chollet, 2017; Sandler et al., 2018). Depthwise-
separable convolutions replace a K × K × Cin × Cout
convolution withK×K convolutions for each input channel
Cin, followed by a point-wise 1×1×Cin×Cout convolution.
Depthwise separation usually leads to significant savings
in the number parameters since K ×K + Cout is typically
much smaller than K ×K ×Cout. It is possible to take this
kernel separability one step further and separate a K ×K
filter spatially asK×1 and 1×K filters. This type of spatial
separation is not commonly used in modern convolutional
neural network architectures, since spatial separation does
not reduce the number of parameters significantly enough
for small kernel sizes. However, spatial separability would
still provide benefits when the kernels are large, and the cost
of horizontal and vertical convolutions are not the same.
In a line-based system, the cost of vertical convolutions
can be disproportionally high due to the number of lines
that need to be buffered before the convolution for a given
window can be computed. For example, a 1× 5 convolution
would need only 4 elements to be buffered, whereas a 5× 1
convolution would need 4 lines of data to be buffered. We
address this problem by replacing the vertical convolutions
with IIR filters (Figure 1).
Using an IIR filter in the vertical direction can approximate
a convolution without producing vertical delay lines. We use
a first-order IIR to approximate a vertical (Nx1) convolution.
A Machine Learning Imaging Core using Separable FIR-IIR Filters
We implement this operator as a spatially-recurrent neural
network cell as:
h[t] = h[t− 1] · w1 + x[t− 1] · w2 + x[t] · w3 (1)
where x is the input, h is the output, w stands for the train-
able weights, and t indicates the spatial position in the verti-
cal axis rather than time.
Recurrent modules are typically used to train machine learn-
ing models on time series data. In our case, they are used
to summarize pixels in the vertical direction. Unlike fixed-
window convolutions, a recurrent module can start process-
ing its input as the pixels arrive line by line without having
to buffer the lines that are spanned by the fixed-sized win-
dow. Therefore, using a recurrent module in the vertical
direction reduces the time distance between the input and
the output of the model.
The recurrent module we use approximates a simple column-
convolution and is not expected to remember long term de-
pendencies. Therefore, it does not use sophisticated gating
mechanisms as long short-term memory (LSTM) or gated
recurrent unit (GRU) modules do.
We use the spatially recurrent modules to define hybrid FIR-
IIR blocks that replace more expensive convolution blocks.
The hybrid FIR-IIR blocks use 3-way (horizontal, vertical,
and depthwise) separable convolutions, where IIR filters
approximate the vertical components.
We use the hybrid FIR-IIR blocks only in the coarsest-scale
(bottleneck) layers, where the impact of convolution on the
overall vertical delay in the system is the largest. For exam-
ple, a 3x3 filter in the bottleneck would cause a 16-line delay
as compared to a 1-line delay in the first layer. Furthermore,
IIR filters are known to handle low frequencies well. This
characteristic makes hybrid IIR-FIR filters well suited for
the bottleneck layer, which processes low-frequency fea-
tures. Since the height of the feature maps at the bottleneck
layer is reasonably small, our model does not suffer from
exploding or vanishing gradient problems.
In the other scales, we use convolution blocks that consist
of group convolutions followed by pointwise convolutions,
where the number of groups is tuned to find a balance be-
tween hardware cost and the quality of the produced images.
The convolution blocks in the first scale have 3 groups of
convolutions. The second scale convolutions have their num-
ber of groups equal to the number of their input channels,
which makes them depthwise separable convolutions.
3.3 Other Design Choices
Our model inputs and outputs up to 6-channel images.
Those six channels can be used to process two RGB im-
ages captured by a pair of cameras, two consecutive frames
captured by a single camera, or a 6-band multispectral image
Hybrid FIR-IIR
Blocks
Depthwise Separable
Convolutions
Silicon
Area
(mm2)
internal
memory 1.42 3.73
logic 1.05 1.05
registers 0.62 0.64
total 3.09 7.14
Receptive field
(H×W) H×184 184×184
Table 1. Silicon area cost and receptive field of using Hybrid FIR-
IIR blocks at the bottleneck layer as compared to using depthwise
separable convolutions. Hybrid FIR-IIR blocks lead to significant
savings in the hardware footprint associated with internal memory.
Mean PSNR Mean SSIM Fully-pipelinedHardware Area
MagIC with Hybrid
FIR-IIR Blocks 27.36 0.80 3.09 mm
2
MagIC with
Depthwise Separable
Convolutions
27.39 0.80 7.14 mm2
U-Net 28.38 0.83 ~500 mm2
Table 2. Comparison of variants of MagIC and a fully-blown U-
Net model in terms of PSNR and SSIM on the test set as well
as silicon area when the network topologies are modeled as line-
buffered hardware using TSMC 16nm technology.
acquired by single or multiple sensors. This design choice
opens up possibilities for a variety of applications that in-
volve stereo depth, sensor fusion, multispectral imagery, and
temporal processing.
The encoder part of our model uses ResNet-like (He et al.,
2016) residual connections. Unlike the long skip lines, those
residual connections have a minimal cost in hardware. Us-
ing residual connections results in easier-to-train models,
stabilized the training, and improved the consistency in re-
sults.
The max-pooling layers in our model use a stride of 4 to be
able to cover a broader range of scales using fewer layers.
Our empirical results showed that given the same number
of max-pooling layers, using a stride of 4 produces higher-
quality outputs than using a stride of 2. Using twice as many
encoder blocks followed by max-pooling layers having a
stride of two does further improve the results but at the
cost of increased hardware area. Reducing the depth of the
model using 4x4 max-pooling layers helped us reduce the
hardware footprint and design a neural network topology
for very small silicon area budgets (∼3mm2). For smaller
area budgets, we also provide the option of approximating
the multiplications in the model with low-cost shift-sum
multipliers (Oron & Michael, 2017), which further reduces
the hardware cost.
A Machine Learning Imaging Core using Separable FIR-IIR Filters
Figure 3. Qualitative comparison of MagIC and U-Net. Given distorted Red/Clear images, the models denoise and deblur the images
while restoring the missing color information to produce clean, sharp, full-color RGB images. From top to down: input images, MagIC
using traditional depthwise separable convolutions, MagIC using our proposed hybrid FIR-IIR blocks, U-Net, and ground truth. Outputs
of the two variants of MagIC are virtually the same although the one using the hybrid FIR-IIR blocks has a much smaller hardware
footprint.
In our area calculations, we assumed the input to be
1920 × 1080 pixels at 30 frames per second. Both the
input resolution and the frame rate impacts the total silicon
area needed to implement our proposed hardware. Specifi-
cally, the input image size affects the internal memory area,
whereas the frame rate impacts the logic area (Table 1). The
area for the memory scales linearly with the image width
since the images are kept on line buffers. Wider inputs re-
quire larger line buffers to accommodate the input, output,
and intermediate feature maps throughout the pipeline. The
logic area also scales approximately linearly with the frame
rate except for frame rates lower than 30 fps. For example,
doubling the frame rate would require twice as many pixels
to be processed in one cycle, resulting in approximately
twice as large logic area. However, reducing the frame rate
by half would not lead to proportional savings in the logic
area due to overheads and hardware inefficiencies.
We calculated the area by modeling our proposed hardware
using the TSMC 16nm process technology. However, the
actual silicon area of the hardware can be made smaller. The
area would be further reduced using more contemporary
manufacturing process technologies, such as the Intel 10nm
technology.
4 RESULTS
We evaluated our proposed model on a combined denoising,
deblurring, and coloring task on the KITTI dataset (Geiger
et al., 2012), emulating an imperfect front view camera
A Machine Learning Imaging Core using Separable FIR-IIR Filters
MagIC with
Hybrid FIR-
IIR Blocks
MagIC with
Depthwise
Separable
Convolutions
Fully-
Pipelined
U-Net
U-Net on
CNN
accelerator
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 4 16 64 256
SS
IM
Hardware Area (mm2) – Logarithmic Scale
Figure 4. Comparison of MagIC with and without the hybrid FIR-
IIR blocks, U-Net implemented as fully-pipelined hardware, and
U-Net running on a CNN accelerator.
on a vehicle. This exemplary task encompassed both low-
level image processing operations such as denoising and
sharpening as well as higher-level tasks such as inferring
the color of an object given texture and context.
We used images in the KITTI dataset as reference images
and generated distorted versions of those images. First, we
applied an approximate color space conversion to the input
images to emulate RCCC (Red/Clear) sensors. RCCC sen-
sors use clear filters instead of the blue and green filters and
are typically designed for automotive use. We converted
RGB images to RCC images by replacing the green and
blue channels by grayscale image intensity (Y) channels.
The Y-channel approximated the clear (C) channels in an
RCCC image. Then, we blurred the images using a random
Gaussian blur operator to simulate a point spread function.
Finally, we added Gaussian noise to each channel indepen-
dently, in the linear domain, at two different scales. We
used a noise variance that correlated with signal intensity,
simulating the noise profile of a real sensor. We separated
a portion of the resulting image pairs for test and used the
rest for training.
We trained MagIC to reconstruct the reference images given
the distorted images as input, learning to denoise and deblur
the distorted images while restoring the missing color infor-
mation in the RCCC input. To measure the impact of the
hybrid FIR-IIR blocks, we also trained a variant of MagIC
that used depthwise separable convolutions instead of sep-
arable FIR-IIR filters. Using the hybrid FIR-IIR blocks
significantly reduced the hardware footprint (Table 1) with-
out having a negative impact on qualitative (Figure 3) and
quantitative results (Table 2).
We also compared both variants to a fully-blown U-
Net (Ronneberger et al., 2015) model trained using the same
setup. As expected, the U-Net model had higher PSNR and
SSIM scores than MagIC, given its orders of magnitude
larger model capacity. However, MagIC was able to achieve
an image quality close to the reference U-Net model within
an area budget of only ∼3mm2, modeled with TSMC 16nm
technology. If we were to implement U-Net as-is using
the same fully-pipelined hardware architecture, the overall
footprint of the resulting hardware would be over 500mm2.
It would indeed be more feasible to implement U-Net us-
ing a generic CNN accelerator within an area budget of
∼10mm2 rather than a fully-pipelined hardware block (Fig-
ure 4). However, using a generic accelerator would result
in lower utilization rate, lower frame rate, and over 8×
power consumption, while still being over 3× larger than
our solution.
5 CONCLUSION
We described a low-cost machine learning imaging core
that used a fixed-topology neural network to process images
in a multitude of ways. We used a bag of tricks to mini-
mize the silicon area needed to implement our model. We
used 3-way separable convolutions and approximated the
convolutions in the vertical direction using infinite impulse
response (IIR) filters. This approximation significantly re-
duced the silicon area needed to implement the underlying
model in hardware. Our proposed hybrid FIR-IIR blocks
not only reduced the latency but also increased the receptive
field of the model, improving the contextual coherence of
the results. We further reduced the cost of our proposed
system by compressing skip lines and carefully designing
the topology of our model. Finally, we showed that our
proposed hardware was able to perform both low-level and
high-level imaging tasks concurrently, within a silicon area
budget of only ∼3mm2.
In this paper, we focused on optimizing a U-Net-like neural
network architecture to perform pixel-to-pixel image pro-
cessing. We believe that our proposed methods have the
potential to be useful in numerous applications beyond im-
age processing. For example, the separable FIR-IIR filters
can help shallow image classification models capture the
context information better. As future work, it would be
interesting to study how the concepts presented in this paper
would generalize to a broader range of applications.
REFERENCES
Chen, Q., Xu, J., and Koltun, V. Fast image processing
with fully-convolutional networks. In Proceedings of the
IEEE International Conference on Computer Vision, pp.
2497–2506, 2017.
Chollet, F. Xception: Deep learning with depthwise separa-
ble convolutions. In Proceedings of the IEEE Conference
A Machine Learning Imaging Core using Separable FIR-IIR Filters
on Computer Vision and Pattern Recognition, pp. 1251–
1258, 2017.
Cutler, C. C. Differential quantization of communication
signals, 1952. US Patent 2,605,361.
Geiger, A., Lenz, P., and Urtasun, R. Are we ready for
autonomous driving? The KITTI vision benchmark suite.
In Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3354–3361, 2012.
Gharbi, M., Chen, J., Barron, J. T., Hasinoff, S. W., and
Durand, F. Deep bilateral learning for real-time image
enhancement. ACM Transactions on Graphics, 36(4):118,
2017.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition,
pp. 770–778, 2016.
Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J.,
Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., and
Hanrahan, P. Darkroom: compiling high-level image pro-
cessing code into hardware pipelines. ACM Transactions
on Graphics, 33(4):144–1, 2014.
Heide, F., Steinberger, M., Tsai, Y.-T., Rouf, M., Pajak, D.,
Reddy, D., Gallo, O., Liu, J., Heidrich, W., Egiazarian,
K., et al. Flexisp: A flexible camera image processing
framework. ACM Transactions on Graphics, 33(6):231,
2014.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
Efficient convolutional neural networks for mobile vision
applications. arXiv preprint arXiv:1704.04861, 2017.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional
networks for semantic segmentation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440, 2015.
Nah, S., Hyun Kim, T., and Mu Lee, K. Deep multi-scale
convolutional neural network for dynamic scene deblur-
ring. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 3883–3891,
2017.
Oron, S. and Michael, G. Instruction and logic for shift-sum
multiplier, June 13 2017. US Patent 9,678,749.
Ronneberger, O., Fischer, P., and Brox, T. U-net: Con-
volutional networks for biomedical image segmentation.
In International Conference on Medical Image Comput-
ing and Computer-assisted Intervention, pp. 234–241.
Springer, 2015.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–
4520, 2018.
Whatmough, P. N., Zhou, C., Hansen, P., Venkataramanaiah,
S. K., Seo, J.-s., and Mattina, M. Fixynn: Efficient hard-
ware for mobile computer vision via transfer learning.
arXiv preprint arXiv:1902.11128, 2019.
Wu, C.-T., Ain-Kedem, L., Gandra, C. R., Isikdogan, F., and
Michael, G. Trainable vision scaler, 2019a. US Patent
App. 16/232,336.
Wu, C.-T., Isikdogan, L. F., Rao, S., Nayak, B., Gerasimow,
T., Sutic, A., , Ain-kedem, L., and Michael, G. Visionisp:
Repurposing the image signal processor for computer
vision applications. In Proceedings of IEEE International
Conference on Image Processing, 2019b.
Zhang, R., Isola, P., and Efros, A. A. Colorful image col-
orization. In European Conference on Computer Vision,
pp. 649–666. Springer, 2016.
DISCLAIMER
No license (express or implied, by estoppel or otherwise) to
any intellectual property rights is granted by this document.
This document contains information on products, services
and/or processes in development. All information provided
here is subject to change without notice. Intel and the Intel
logo are trademarks of Intel Corporation in the U.S. and/or
other countries.
c© Intel Corporation.
