Convolutional Neural Networks (CNNs) have become indispensable for solving machine learning tasks in speech recognition, computer vision, and other areas that involve highdimensional data. A CNN filters the input feature using a network containing spatial convolution operators with compactly supported stencils. In practice, the input data and the hidden features consist of a large number of channels, which in most CNNs are fully coupled by the convolution operators. This coupling leads to immense computational cost in the training and prediction phase. In this paper, we introduce LeanConvNets that are derived by sparsifying fully-coupled operators in existing CNNs. Our goal is to improve the efficiency of CNNs by reducing the number of weights, floating point operations and latency times, with minimal loss of accuracy. Our lean convolution operators involve tuning parameters that controls the trade-off between the network's accuracy and computational costs. These convolutions can be used in a wide range of existing networks, and we exemplify their use in residual networks (ResNets) and U-Nets. Using a range of benchmark problems from image classification and semantic segmentation, we demonstrate that the resulting LeanConvNet's accuracy is close to state-of-the-art networks while being computationally less expensive. In our tests, the lean versions of ResNet and U-net slightly outperforms comparable reduced architectures such as MobileNets and ShuffleNets.
I. INTRODUCTION
C ONVOLUTIONAL neural networks (CNNs) [1] are among the most effective machine learning approaches for processing structured, high-dimensional data such as voice recordings, images, and videos and have become indispensable in, e.g., speech recognition [2] , [3] , audio processing [4] , and image classification [5] .
In the forward propagation, a CNN filters the input features through a sequence of layers, which are composed of convolution operators, biases, normalization layers, nonlinear activation functions, and pooling operators. In imaging tasks, the input features and the hidden features at each layer can be grouped into several channels, each of which can be interpreted as an image. The stencils that parameterize the convolution operators are typically chosen to have a very small support around the origin. Hence, each feature in an image interacts with features from a small neighborhood in its channel and, in the standard, fully-coupled approach, the features from the same neighborhood in the remaining channels [6] , [7] . A drawback of the fully-coupled approach is that the number of convolution operators in a layer is proportional to the product of the number of input and output channels. This scaling can render wide architectures (i.e., architectures whose layers contain a large number of channels) prohibitively expensive in training and inference. It also complicates the deployment of such CNNs, especially on devices with limited memory and computing resources like autonomous vehicles, drones, and smartphones.
In recent years there has been an effort to improve the efficiency of CNNs. Common approaches to reduce the number of weights in CNNs are pruning [8] - [13] , sparsity [14] - [16] , and quantization [17] - [19] . Pruning reduces the number of weights in the network after training. The fact that in many cases large portions of the networks' weights can be removed with only minimal reduction of its accuracy indicates a considerable redundancy and over-parameterization of standard CNNs [20] . While pruning is effective in reducing the number of weights and floating point operations (FLOPs), it generally leads to an unstructured non-zero structure of the weights, which increases the memory access costs. The lack of structure also complicates the efficient deployment of the CNN on hardware.
Another approach to improve the efficiency of CNNs is to replace fully-coupled convolution operators by sparser convolution operators (i.e., operators with fewer non-zero elements) before training. One typical building block is known as a grouped convolution operator, which partitions the channels into groups and only allow grouped coupling; see, e.g., [5] . When the number of groups equals the number of channels, one obtains a depth-wise convolution operator, which is a block diagonal matrix whose blocks are spatial convolution operators. The depth-wise convolution operator filters each channel of the image data separately and thus restricts the interaction of each feature to its nearby features in the same channel. It is common to use the depth-wise operator in conjunction with fully-connected point-wise 1×1 convolutions to introduce coupling across the channels.
A few CNN architectures have been derived using depthwise and 1 × 1 convolution operators, often augmented with bottleneck or shuffling techniques; see, e.g., [21] - [25] . These works use the depth-wise and 1×1 convolution separately, with activation and batch normalization layers in between them. This typically requires a redesign of existing CNN architectures. To reduce the ratio between FLOPs and memory access in depth-wise convolution operators, replacing convolutions with shifts has been proposed in [26] . It is known, however, that the memory access is often the real bottleneck in modern parallel hardware, and not necessarily FLOPs. In fact, state-ofthe-art implementations of depth-wise convolution operators on GPUs involve more FLOPs than necessary to achieve lower runtimes; see, e.g., [27] . Nevertheless, whether the dominant cost is the storage of the parameters, the FLOPs, or the memory access, is highly dependent on the hardware at hand. Therefore, it is desirable to build convolution operators and architectures that are flexible in their definition so that they can be configured as necessary on any specific hardware.
In this paper, we introduce LeanConvNets, a new family of CNNs built as lean versions of known networks, using lean convolution operators. These operators reduce the number of weights, computation time, and FLOPs while achieving competitive results. The lean operators preserve the overall network structure and can thus be applied to a variety of networks, e.g., residual networks (ResNets) [28] , [29] , U-Net encoder-decoder networks [30] , which have been two of the most reliable architectures in the literature. The following two aspects set our work apart from other approaches: 1) We obtain a new operator as the sum of the grouped and 1 × 1 convolution operators. Using a prototype implementation, we show that handling both operations simultaneously reduces the computation time required to apply the operator. Also, this design introduces several opportunities for optimization in hardware through its parallelism and its minimal number of memory accesses. Adding the operators also reduces the number of weights slightly. Using grouped instead of depth-wise convolution operators allows one to gradually enlarge the portion of spatial convolutions in order to improve the performance of the lean networks. 2) We present two ways to reduce the spatial kernel size that further decrease the number of weights and FLOPs and are easy to implement efficiently. In the first method, we replace the standard 3 × 3 by a 5-point stencil. In the second method, we filter two-dimensional images using a one-dimensional convolution operator (3 × 1 or 1 × 3, depending on memory layout) and its transpose applied at the memory write. This operator can be implemented with the same number of memory accesses as the 1 × 1 convolution since the memory of the feature maps is sequential in the one dimension. The remainder of the paper is organized as follows: In Sec. II, we discuss existing convolution operators and their computational costs in the context of residual neural networks. In Sec. III, we introduce a family of lean convolution operators, analyze their costs, and outline their implementation. In Sec. IV, we provide extensive numerical evidence for the efficacy of the resulting LeanConvNets for image classification and semantic segmentation. In Sec. V, we summarize the paper and discuss directions for future research.
II. PRELIMINARIES AND NOTATION
We now introduce our main notation and define the supervised classification and semantic segmentation problems that we use to validate our methods; for more details see [7] . For brevity, we restrict the discussion to images although the techniques derived here can also be used for other structured data types such as audio or video data. In supervised learning, we are given a set of training data consisting of pairs,
is the k-th input image and c (k) either represents the probabilities for the entire image (in classification) or each pixel (in segmentation) to belong to one of the pre-defined classes. Our goal is to define a neural network architecture and train its weights θ ∈ R p and the weights of a linear classifier, denoted by W ∈ R nc×nout and µ ∈ R nc , such that
for all k = 1, 2, . . . , s.
Here, S is a softmax hypothesis function and y (k) (θ) ∈ R nout denotes the output features of the neural network applied to the kth data sample. The learning problem can be phrased as a minimization problem of a regularized empirical loss function
where L is the cross entropy loss and R is a regularization function. The optimization problem is usually solved using variants of stochastic gradient descent (SGD); see the original work [31] and the survey [32] .
As a baseline architecture, we consider residual networks (ResNet) [28] , [29] , which have been very successful in many imaging tasks. Given a data sample, y 0 , the forward propagation through an N -layer ResNet is defined as y l+1 = y l + F(θ l , y l ), for l = 0, . . . , N − 1, (1) where θ l is the set of weights associated with the l-th layer and we define y(θ) = y 0 . There are different choices for the nonlinear term in (1), e.g., F(θ l , y l ) = K 2 (θ l,2 )σ(N (K 1 (θ l,1 )σ(N (y l )))). (2) Here, σ(x) = max{x, 0} denotes an element-wise rectified linear unit (ReLU) activation function and the weights are partitioned into θ l,1 and θ l,2 that parameterize the two linear operators K 1 and K 2 , respectively. For brevity, we omit the weights of the normalization layer N .
In convolutional ResNets, the operators K i in (2) are composed of spatial convolution operators. If the input y has c in channels, and the output Ky has c out channels, then the common choice for an operator K is a c out × c in block matrix of convolutions, introducing full coupling across the channels. For example, if c in = c out = 4, then the convolution operators in (2) can be written in matrix form as
where C (i,j) = C(θ (i,j) ) denotes the sparse matrix associated with the spatial convolution kernel parameterized by the 3 × 3 filter θ (i,j) ∈ R 9 . For ease of notation, we do not explicitly denote the dependency on θ in the following. The sparsity pattern of this operator is visualized in the leftmost subplot Fig. 1 : Example of sparsity pattern for different convolution operators for 6 × 6 images with four input and output channels, respectively. The leftmost subplot shows the sparsity pattern of a 3 × 3 fully-coupled convolution operator. The next two subplots depict the grouped convolution operators for g = 2 and g = 4, respectively. The remaining two subplots show the proposed lean grouped and depth-wise operators that are based on a sum of a fully coupled 1 × 1 convolution and a grouped (or depth-wise) spatial convolution operators.
of Fig. 1 for an image size of 6 × 6. Applying K full requires O(c in · c out ) FLOPs, and K full has 9 · c in · c out weights. In practice, each K full can have millions of weights.
Grouped convolutions are popular alternatives to K full as they reduce the number of weights and computations. In our example, we can restrict the interaction of the channels to g = 2 groups, which leads to the block diagonal matrix
This reduces the number of weights and FLOPs by a factor of g compared to the full convolution. Clearly, K full = K g=1 and for g = c in , we get the depth-wise convolution; the sparsity pattern of K g=2 and K g=4 are shown in Fig. 1 . Grouped convolutions can be extended to rectangular operators when g divides both c in and c out .
III. LEAN CONVOLUTIONAL OPERATORS We now introduce a family of lean convolutional operators that achieve competitive performance and reduce the number of weights, memory access, and FLOPs. It has been shown that 1 × 1 convolutions can be very effective if complemented with relatively small number of spatial convolutions [21] - [25] . In these settings, the computational cost of the 1 × 1 convolution (in terms of FLOPs and number of weights) dominate the cost of the spatial convolutions as the number of channels grows. It has also been observed that the accuracy of the network suffers from the relative shortage of spatial convolutions, which is often explained by a relatively small number of weights. To increase the accuracy, our lean convolution operators aim to allocate the weights more efficiently between grouped and 1 × 1 convolutions. To this end, we reduce the kernel size on the one hand and add spatial convolutions on the other, which slightly increases the computational costs. The group size of the spatial convolution is a hyper parameter that trades off between accuracy and computational efficiency. This also allows us to accommodate different computational devices without changing the high-level structure of the network layers.
We obtain lean convolution operators in three steps. First, lean operators are a sum of grouped and 1 × 1 convolutions, which, if implemented efficiently, allows one to reuse of memory access, increase parallelism, and further reduce the number of weights. Second, the group size parameter g allows the user to balance between spatial filtering (by a grouped operator) and coupling (by 1 × 1 operators) and control the performance of the network. Third, we use convolutional filters with only five or three elements instead of 9 for common 3×3 filters, which reduces the number of weights and FLOPs.
Continuing our example from above, we define the lean analogue to (4) as
where I is a scaled identity matrix and α i,j ∈ R are weights. The identity operators represent the 1 × 1 convolution and the convolution operators C (i,j) enable spatial filtering. Our lean convolution operators are a linear combination of grouped and 1×1 convolutions. In this setting, the lean operator with g = 4 groups is
The sparsity patterns of these operators are shown in fourth and fifth subplots of Fig. 1 .
The specific setting in (6) with g = 4 can be seen as the sum of depth-wise and 1×1 convolutions, which are also used in [21] , [22] , [25] . These works perform the depth-wise and 1 × 1 convolutions separately -the depth-wise convolution is applied between two 1 × 1 convolutions with batch norm and ReLU operations between them. Since we sum both operators, we can apply them both simultaneously, which allows us to optimize memory access, improve parallelism (more work can be done at once), and reduce the number of weights.
A. The argument for groups in compact networks
Our experiments suggest that if we take a given compact network that utilizes depth-wise and 1 × 1 operations, and define its full version by placing a 3 × 3 convolution instead of each 1 × 1 convolution, we get networks that are significantly more expensive, but perform better (e.g., in terms of classification or segmentation accuracy). Employing such scheme may result in a spatial component that is too small, especially when the number of channels is large and the 1 × 1 operators dominate the spatial convolutions. This motivates us to add a small number (e.g., cincout g ) of spatial convolutions to improve the performance of the network compared to depthwise operators.
The motivation for using such an operator is as follows: first, the implementation of the grouped convolution works best in groups of intermediate size, and it is often even more efficient to zero-pad the groups (artificially enlarge them) to get better computational performance on GPUs [27] . Second, in the standard combination of depth-wise and 1×1 convolutions, the the former becomes negligible compared to the the latter as the number of channels grow, hurting accuracy without providing considerable savings. Our proposal in this context is to use the groups mechanism to keep a constant ratio of operations between the two types of convolutions, such that the 1 × 1 convolution that has c in · c out weights is more dominant than the grouped convolution that has (r−1)cincout g weights 1 , where r is the stencil size (e.g., r = 9 for a 3 × 3 stencil). For example, if we choose a ratio of 1 8 , then we set g ≈ 8(r − 1),
and make sure that the number of channels is divisible by g. We subtract 1 from r since the middle weight is included in the 1 × 1 convolution. We note that enhancing the lean convolution with the grouping mechanism can also be applied for enhancing the depth-wise convolution in networks such as MobileNetV2. This would result in a network that is similar to the successful ResNeXt networks [33] , which applies a grouped convolution instead of the full 3 × 3 convolution in the bottleneck version of the original ResNet. The grouping mechanism helps enlarge the bottleneck expansion while keeping a low additional cost, and without adding many weights. Although the works were proposed independently, maximizing the number of groups in ResNeXt leads to a network which is very similar to MobileNetV2.
B. LeanConv 5-pt: lean convolutions with 5-point stencils.
The first version of LeanConvNets is based on 5-point convolution stencils. The idea is to replace the stencils of C (i,j) in (5) and (6) by the 5-point stencil
where α i,i is the i, i-th entry of the 1 × 1 convolution, and c i,1 , ..., c i,4 are additional four weights per input channel i. An 1 The cost in operations is proportional to the number of weights. example for the sparsity pattern of the resulting lean operator with 5-point convolutions is shown in Fig. 1 . The operator K lean,g with 5-point stencils and g groups has (1+ 4 g )(c in ·c out ) weights. We note that if the number of channels and g are large, then the 1 × 1 convolution is the dominating operator both in terms of weights and FLOPs.
The lean convolution can replace fully-coupled convolution operators in many existing CNNs without any structural changes to the architecture. A straightforward way to implement a grouped lean convolution like in (5) is to use the package cudnn to perform the 1 × 1 and spatial convolutions separately. As we show later, our custom implementation, which simultaneously applies both operations, outperforms the cudnn approach for g = c in .
C. Interpretation of the 5-point stencil in ResNets
ResNets have been recently interpreted as time-dependent nonlinear ordinary differential equations (ODEs); see, e.g., [34] - [39] . This allows the community to analyze and extend ResNets using theoretical and practical ideas from the world of ODEs and, in the case of CNNs also PDEs [40] . In this point of view, the elements of the five-point stencil are able to express a mass term, and a discretization of first and second spatial derivatives each of the two spatial dimensions. That is, the first and second partial derivatives in the x dimension can be approximated by
where h x is the edge length of a pixel. This, together with ∂ 2 ∂y , ∂ ∂y 2 and the mass term (or 5-point low-pass filter) are included in the span of the 5-point stencil (8) . The remaining four entries of a full 3 × 3 stencil correspond to mixed partial derivatives, which rarely occur in PDE-models, and are thus good candidates for reducing computations and weights.
D. Implementation of grouped LeanConvNets
The standard full 3 × 3 convolution is implemented using a shift per stencil parameter (known as the shiftIm2col operation) and a matrix-matrix multiplication using the function gemm. Multiplying a 5-point convolution operator can be done using the same mechanism, trivially saving 4/9 of the operations. For small group sizes, and in particular for the depth-wise setting of group size of 1, we found that a direct implementation of the convolutions is faster than the standard implementation with shiftIm2col. As the groups get larger, the approach using shiftIm2col is more preferable, due to the efficiency of gemm which is highly optimized. Even more savings can be realized in 3D CNNs where the standard 27-point convolutions are replaced with a 7-point stencil only (the 3D version of (8)), saving 20/27 of the operations and weights.
E. LeanConv 3-pt: lean convolutions with 1D 3-point stencils
In this section, we present a more sophisticated lean convolution operator that can be applied almost at the same cost as a 1 × 1 convolution as both operators use the same memory accesses. This convolution is based on 1D convolution operators, either 1 × 3 or 3 × 1, which can be applied efficiently if the memory is continuous in the direction of the 1D kernel.
In addition to the benefits from an implementation perspective, the use of 1D kernels can also be motivated as follows: It is known that in 2D, a large portion of the 3 × 3 kernel can be parameterized by a multiplication of 1 × 3 and 3 × 1 kernels, also called separable kernels. Separable kernels include many of the important operators, such as low-pass filters, and the spatial derivatives in (9) . Our idea is to use each two consecutive convolutions, such as K 1 and K 2 in (2), to effectively apply separable operators: K 1 applies a 1×3 kernel in the horizontal direction, and K 2 applies a 3×1 kernel in the vertical direction. We note that 1D stencils were also used in a small section of the InceptionV4 network [41] . There, 1×7 and 7×1 were used in together with 3×3 convolutions to increase the field of view of the network. Here, we show that using 1×3 and 3 × 1 convolutions only to reduce the number of weights and FLOPs can result in a very effective network. Another nice feature of this approach is that memory access times can be saved if the memory of the feature maps is alligned with the direction of the kernel. The key to maintain the memory alignment is to apply the convolutions together with channel transposition. That is, if the 1D convolution operator K 1 is aligned with the memory, then the feature maps are transposed during the operation to prepare the output to K 2 that is aligned in the other direction, and vice-versa.
1) GPU Implementation: To explain and motivate for the 3pt lean convolution, we first briefly show one of the approaches for multiplying matrices on a GPU-that is essentially the 1 × 1 convolution operator. We follow the description of the cutlass library [42], and the implementation of the MAGMA open source project [43] . Given two matrices K ∈ R cout×cin and Y ∈ R cin×n , we first divide their product KY ∈ R cout×n into tiles of size t n × t o . Each of these tiles is computed by a multiplication of a block of t o columns of K and t n columns of Y. These two sub-matrices are also divided into sub-blocks of size t i . Each group of available physical cores gets a task of computing a tile of t n × t o output numbers. To apply this, we first fetch the relevant tiles into shared memory, and then multiply them in parallel. Algorithm 1 summarizes the procedure; see [42] , [43] for more details. # Each thread fetches two memory blocks:
5:
Fetch tile i, k from K to shared A ∈ R tn×ti . 6: Fetch tile k, j from Y to shared B ∈ R ti×to .
7:
# Multiply AB by t i outer products: 8: Multiply AB in parallel and add into X.
9:
end for 10: end procedure Now we explain how to apply the 3-point lean operator, assuming that the number of convolutions is relatively small compared to c in × c out . An important consideration for GPU implementations is that fetching memory from global memory into shared memory is slow, while accessing the shared memory is fast. Our idea is to add a small memory fetch (per thread) into the procedure above, and apply the 1D convolution to the already-fetched tile of Y, assuming that the memory of Y is continuous in the same direction of the 1D kernel. To this end, and to have better load balancing in the spatial convolutions, we consider a the permuted version of the operator (5)
where P denotes a permutation matrix that exchanges the second and third columns and rows. After the permutation, the operator divides into four blocks where each consists of a single convolution on each channel. To apply the operator in (10) on a t n × t o tile of Y, the threads have to fetch the neighboring values of the tile (2t o variables), and the convolution kernels, which are 2t o per convolution in a row of (10) to have the t o 3×1 stencils. Once the memory is fetched, we apply the convolution, which is only 2 more FLOPs for convolution in a row of (10) for each entry in KY. If the number of convolutions is small (equivalent to a large number of groups g), then fetching the extra variables and applying the convolution have negligible cost.
To finish the operation, we now wish that the direction of the next kernel is aligned with the direction of the data. Obviously, this will be true only for one direction, and we handle that by transposing the feature maps during the write phase at the end of the convolution. Thus, when multiplying K 1 in (2) we have the maps aligned in one direction, but during the multiplication, we transpose the data in shared memory and write it already transposed. After the ReLU operation (for which the direction does not matter) the input to K 2 is ready to be multiplied and is aligned in the other direction. At the end of the same multiplication of K 2 , the result is again transposed during the write phase and is brought back to the original alignment. Algorithm 2 summarizes the 3-pt lean convolution procedure. Compared to the 5-point version of the lean convolution, the 3pt conv requires that at least least two kernels are applied one after the other before a skip connection, so that the maps can be transposed back to their original form. We note that this algorithm is only beneficial with large number of groups, and in particular with depthwise configuration. If the number of groups is small and we have a large portion of convolutions, it is better to use the shiftIm2Col approach. # t n ,t o ,t i : tile sizes. (i, j): thread id.
3:
Fetch boundary values from Y and spatial 4: convolution parameters into shared memory C.
5:
for k = 1, ..., cin t do 6:
# Each thread in the tile fetches two values:
7:
Fetch tile i, k from K to shared A ∈ R tn×ti .
8:
Fetch tile k, j from Y to shared B ∈ R ti×to .
9:
# Multiply AB by t i outer products:
10:
Multiply AB in parallel and transpose maps.
11:
If relevant to the output tile: 12: Apply convolution to B and transpose maps.
13:
Write transposed results to X.
14:
end for 15: end procedure
IV. EXPERIMENTS
We demonstrate the proposed LeanConvNet approach and compare a lean version of ResNet, called "LeanResNet" to a fully-coupled ResNet, and other recent state-of-the-art compact architectures: ShuffleNetV2 [25] , MobileNetV2 [22] , and ShiftResNet [26] . We consider the image classification and semantic segmentation tasks using five data sets (in total). Our primary focus is to compare how different architectures perform using a relatively small number of parameters and FLOPs (we count floating point multiplications). Our experiments are performed with the PyTorch software [44] . In a third experiment, we also show that our lean operators can be implemented efficiently and, for g = c in , outperform the separate application of depth-wise and 1 × 1 operators using the highly optimized package cudnn.
As our focus is on the performance of the lean convolution operators, we use the established ResNet architectures as baseline for comparison, and we use the same structure of those ResNets, only with lean convolutions. Our ResNet networks consist of several blocks that are preceded by an opening convolutional 3 × 3 layer, which initially increases the number of channels. Then, there are several blocks, each consisting of a ResNet-based steps like Eq. (1), depending on the experiments. Each convolution is followed by a ReLU activation and batch normalization as described in (1) . To increase the number of channels and to down sample the image, we concatenate the feature maps with a depth-wise convolution applied to the same channels, thus doubling the number of channels. This is followed by an average pooling layer. The last block consists of a pooling layer that averages each channel's feature map to a single pixel, and we use a fully-connected linear classifier with softmax and cross entropy loss.
Although the architectures of LeanResNets and ResNet are very similar, the former employs more efficiently parameterized convolutions such as (6) , and hence require less parameters. The convolution sizes of MobileNetV2 and ShiftResNet were chosen such that the size of the expanded (by 6) 1 × 1 convolution in a layer is equivalent to the size of a square 1×1 convolution of LeanResNet. The architecture of ShuffleNetV2 is evaluated with the configurations (0.5x,1.0x,1.5x,2.0x) that were introduced in [25] .
A. Image Classification
We consider the CIFAR10, CIFAR100, STL10, and tinyIm-ageNet200 datasets. The CIFAR-10/100 datasets [45] consists of 60k natural images of size 32 × 32 with labels assigning each image into one of ten categories (for CIFAR10) or 100 categories (for CIFAR100). The data are split into 50K training and 10K test sets. The STL-10 dataset [46] For each of the data sets we used a different configuration, according to the difficulty of that data set. Table I summarizes the network weights that we use, which differ in the number of channels and the number of repetitions for each layer.
As optimization strategy for TinyImageNet200 we use momentum SGD with a mini-batch size of 64 for 300 epochs. The learning rate start at 0.05 and is reduced to 0.01, 0.005 and 0.001 after the epochs 75, 150 and 225 respectively. The weight decay is 0.0001 and the momentum is 0.9. The strategy for the other data sets is similar, with slight changes in the number of epochs, batch sizes and the timing for reducing the learning rate. We use standard data augmentation, i.e., random resizing, cropping and horizontal flipping. To make a fair comparison between the different architectures, we seek to match the number of parameters and FLOPs for each method. The number of channels and expansion in MobileNetV2 and ShiftResNet was defined so that the 1 × 1 operation in both methods has the same number of parameters as in LeanResNet, where c in = c out in the basic block. For expansion of = 6, we choose the width of the MobileNetV2 and ShiftResNet to be approximately √ 6 smaller than the width of LeanResNet, so that their number of parameters and FLOPs are comparable. We denote this by * . Our classification results are given in Table II , where we chose three representative configurations of groups for the lean convolutions. The results show that our architecture is on par with and in some cases better than other networks. There is no preferred architecture between all options, but our architecture has the advantage of simplicity and resemblance to a standard and reliable ResNet network, which, as expected, yields better accuracy than all the other network at the expense of more parameters and cost. In Fig. 3 we show the training and validation convergence plots of the architectures for the Tiny-ImageNet200 data set. The plots show that the convergence of the LeanResNet is similar to that of ResNet. This is expected because the architectures are similar (in length, width, and number on non-linear activations). The other compact networks we compare with have different structures, hence the convergence is different in their experiments, leading to higher training errors.
The influence of groups and stencils size: In this set of experiments we demonstrate the classification accuracy of LeanResNet with different configuration of grouping and stencil sizes on CIFAR10 and CIFAR100 data sets. We use small networks so that the differences in performance are more obvious. Table III presents the classification results. The configuration of g = c in /q indicates that the group sizes are equal throughout the layers, and leads to more FLOPs but less weights than the constant number of groups g = q. Since g linearly increases as function of number of channels, we get relatively dense convolutions at the first layers of the network (large maps, a small # of channels) and sparser convolutions at the last layers of the network (small maps, large # of channels). In these examples, having more parameters at the beginning of the network increases the accuracy, at the expense of more FLOPs. The configuration is advantageous when having a low number of parameters is more crucial than FLOPs. On the other hand, keeping the number of groups constant adds a fixed proportion of parameters and FLOPs to the 1 × 1 convolution, and should be chosen in cases where FLOPs cost as considerably as number of parameters. As a result, the optimal configuration for an application can be wisely chosen based on the limitations of the target device. If there is a constraint on the number of FLOPs, then a constant number of groups can be beneficial, but if the emphasis is on a lower number of parameters, then, then a configuration of g = c in /q will be more suitable for the application. In addition, the table shows that by adding a small addition of parameters to the lean network yields higher accuracy, which gets closer to the considerably larger fully-coupled network. 
B. Semantic Segmentation
We demonstrate the efficiency of our proposed network for the semantic segmentation task. It is interesting to examine the efficiency of networks in such a task, because it is needed for autonomous vehicles. These require real-time predictions, and by design have less computational power. We use the general U-net architecture [30] , built on top of ResNet as a backbone. That is, we adopt an encoder-decoder scheme, where the encoder is of standard ResNet architecture and the decoder is based on upscaling operations and transposed convolutions within a ResNet block. Similarly to the classification task, the U-net based on ResNet is used as a baseline. With these settings, we use the baseline with similar networks incorporating various backbones as encoders: MobileNetV2, ShuffleNetV2, ShiftResNet, and ours. As part of the decoders, we perform convolutions to decrease the number of channels and then perform upsampling, such that in the last layer we have an image, the same size of the labeled image. We use CityScapes (fine-annotated) [49] which contains 5000 finelyannotated images with 19 categories ranging from road, vehicles, trees and humans. We use the standard train-validation data split as in [49] , i.e.; 2975 and 500 for training and validation,respectively. We resize the images from 1024×2048 to 512 × 1024 due to memory and computational limitations. The work [50] recently showed that the performance reduction is marginal when down-sampling the images by a factor of 2. Also, we use standard augmentations like random horizontal flips and random rotate of 10 degrees to enlarge our training data.
In the training process, we use the ADAM [ learning rate is 1e-4 and we employ an adaptive learning rate reduction, where upon stagnation of the mIoU metric for more than 5 epochs, the learning rate is decreased by a factor of 10. We use the Focal loss [52] as it penalizes wrong segmentations more than correct ones, relative to Cross-Entropy loss. In table IV we summarize the configurations used for the segmentation experiments, where again, we tried to configure the sizes of all the compact architectures to have similar number of parameters and FLOPs. Table V shows the segmentation results. Similarly to the classification results, the lean networks yield performance that is comparable to the other compact architectures. In particular, the grouped lean versions again yield the best accuracy among compact networks, with a slight increase in the parameters and FLOPs. Table VI shows the segmentation accuracy per class, and Fig. 4 shows two example images from the data set and their segmentation result with the different methods. All of our experiments are done without pre-trained models to make a fair comparison to our model. In any case, the results with pre-trained models that we checked are only slightly better than what is shown above, and are still comparable to the results with our lean networks. 
C. Computational Performance
We compare the latency of our CUDA implementation of the lean convolution with two other combination of layers, comprised of a 1 × 1 convolution that is followed by a depthwise convolution. In one combination we use c in = c out , and in the other c in ≈ 6c out , but with the same number of weights. Such layers are applied in [22] . We compare the runtime of a typical network: the first layer consists of 16 channels of 512 × 512 maps, and the maps are coarsened by a factor of 2 when the channels increase by a factor of 2 (i.e., for 512 channels the images are of size 16). We use a batch size of 64, and compare the runtime of a NVIDIA GeForce 1080Ti GPU for the task. The implementation for the other convolutions is based on PyTorch's 1 × 1 and grouped convolutions using CUDA 9.2. Figure 5 summarizes the results. The depthwise convolutions dominate the low channels layers, while all combination converge to the cost of the 1 × 1 convolution as the channels increase (and the depthwise layer becomes negligible). Our implementation of (6) is comprised of a standard 4-point convolution for each channel followed by a matrix multiplication using cublas for the 1 × 1 part, to use the highly optimized gemm kernel. Our implementation is faster because the shiftIm2Col approach is not efficient for small group sizes (1 in this case). The clear advantage of the lean operator over the expanded combination is the less feature maps that undergo the spatial convolution. Although our implementation applies the 1 × 1 and depthwise convolutions separately inside the same CUDA function, our experiments show that the simultaneous multiplication for the samples in the mini-batch yields a performance gain compared to a completely separate multiplication of the whole minibatch.
V. CONCLUSION
We present LeanConvNets, a family of efficient CNNs that reduce the number of weights, and floating point operations with minimal loss of accuracy. LeanConvNets can be obtained from existing CNNs by replacing fully-coupled convolution operators by lean operators that are the sum of grouped and 1 × 1 convolutions. The group size serves as a hyperparameter that allows the user to trade off computational cost and accuracy. Additional savings can be realized by the proposed five-point and three-point stencils. Those savings will be more pronounced for 3D and 4D imaging data. Fig. 4 : Visualization of the semantic segmentation results of different networks for two images from the Cityscapes dataset.
In our experiments, we apply various configurations of LeanConvNets to image classification and segmentation tasks. In our tests, the LeanConvNets perform slightly better than other reduced architectures, and is almost as effective as their fully-coupled counterpart. We also demonstrate in a direct comparison that the addition of depth-wise and 1 × 1 convolution reduces the computational time.
Our future research aims to further optimize implementation of the lean convolutions on GPUs, as well as investigate optimization of such implementation on other devices. In addition, it is worthy to investigate and characterize the hyperparameter choices of the lean convolution (groups, stencil size, multiplication algorithm), as those choices should be guided by the hardware [53] . We also plan to examine the efficiency of the lean operators in challenging 3D applications such as video analysis on limited devices [54] , where the small stencil size is more beneficial.
