Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters with enough capacity to memorize these volumes and obtain state-of-theart accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, with the end of Moore's law, there is a limit to such scaling. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, which drastically reduce the computation during both training and inference, with simple multi-core parallelism on a modest CPU. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level. We provide codes and benchmark scripts for reproducibility 1 .
shows the first possibility of an algorithmically efficient solution by employing Locality Sensitive Hash (LSH) tables to identify a sparse set of neurons efficiently during each update. The proposed algorithm has an added advantage of making the gradient update HOGWILD style [28] parallel. Such parallelism does not hurt convergence because extremely sparse and independent updates are unlikely to overlap and cause conflicts of considerable magnitude.
Despite all the niceness presented, current implementations of [33] fail to demonstrate that the computational advantage can be translated into a faster implementation when directly compared with hardware acceleration of matrix multiplication. In particular, it is not clear if we can design a system that can effectively leverage the computational advantage and at the same time compensate for the hash table overheads using limited (only a few cores) parallelism. In this paper, we provide the first such implementation for large fully connected neural networks.
Current State of Things: Recently NVIDIA released Tesla V100 which is the state-of-the-art advanced data center GPU built to accelerate DL. Powered by NVIDIA Volta, the latest GPU architecture, Tesla V100 offers the performance of up to 100 CPUs in a single GPU. Recent benchmarks have shown that deep learning with V100 GPUs achieves performance comparable with TPUs (Tensor Processing Units). TPUs are dedicated specialized hardware designed by Google [14] . This puts V100 as one of the topmost choices for training DL architectures [29] .
Our Contributions
Our main contributions are as follows:
• We show the first C++ OpenMP based system SLIDE with modest multi-core parallelism on a standard CPU that can outperform the massive parallelism of a powerful V100 GPU on a head-to-head time-vs-accuracy comparison. The most exciting part is that we do not require any specialized CPU level parallel instruction (such as SIMD) to achieve this. The unique possibility is because the parallelism in SLIDE is naturally asynchronous by design. SLIDE is a promising illustration of the power of smart algorithms in scaling up deep learning without specialized hardware support. We have released the codes and benchmarks scripts in the public domain for reproducing the numbers in this paper.
• We made several novel algorithmic and data-structural choices in designing the LSH based sparsification to minimize the computational overheads. In particular, our randomized algorithm in expectation leads to a very efficient adaptive dropouts mechanism during every gradient update. This mechanism minimizes the retrieval overhead to a few memory lookups only (truly O(1)). At the same time it does not affect the convergence of the DL algorithm. The implementation further takes advantage of the sparse gradient updates to achieve negligible update conflicts, which creates ideal settings for Asynchronous SGD (Stochastic Gradient 
Neuron Network
Hash Table 1 Hash Table L   00 For an input, we obtain multiple hash codes and retrieve candidates from the respective buckets.
Descent) [28] convergence. These contributions could be of independent interest in both the LSH and DL literature.
• We design and build a prototype of the proposed system SLIDE in C++. Building SLIDE involves coding up neural networks and the Adam optimizer [16] from scratch that replaces standard dense vector multiplication to sparse hash table based lookups. We further need additional design choices to minimize read-write and write-write conflict in asynchronous parallelism which can hurt the convergence.
• We provide a rigorous evaluation of our system on two large benchmarks involving fully connected networks and show the benefit of SLIDE compared to the most optimized implementations over the best available hardware tailored for the baselines. Our results show that, SLIDE on a modest CPU can be orders of magnitude faster, in wall clock time, than the best possible alternative with the best possible choice of hardware, at any accuracy. Furthermore, our evaluations clearly show the need and importance of the design choices made.
Background
Our paper is based on several recent and old ideas in Locality Sensitive Hashing and adaptive dropouts in neural networks. We first briefly review important concepts.
Locality Sensitive Hashing
A popular technique for approximate near-neighbor search uses the underlying theory of Locality Sensitive Hashing [12] . LSH is a family of functions with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H to be a family of hash functions mapping R D to some set S. 
Typically, for approximate nearest neighbor search, p 1 > p 2 and c < 1 is needed. An LSH allows us to construct data structures that give provably efficient query time algorithms for the approximate near-neighbor problem with the associated similarity measure.
One sufficient condition for a hash family H to be an LSH family is that the collision probability P r H (h(x) = h(y)) should be a monotonically increasing with the similarity, i.e.
where f is a monotonically increasing function. In fact, most of the popular known LSH families, such as Simhash [8] and WTA hash [39, 4] , satisfy this strong property. It can be noted that Equation 1 automatically guarantees the two required conditions in the Definition 2.1 for any S 0 and c < 1.
It was shown in [12] that having an LSH family for a given similarity measure is sufficient for efficiently solving nearest-neighbor search in sub-linear time:
The Algorithm: The LSH algorithm uses two parameters, (K, L). We construct L independent hash tables from the collection C. Each hash table has a meta-hash function H that is formed by concatenating K random independent hash functions from F. Given a query, we collect one bucket from each hash table and return the union of L buckets. Intuitively, the meta-hash function makes the buckets sparse and reduces the number of false positives, because only valid nearest-neighbor items are likely to match all K hash values for a given query. The union of the L buckets decreases the number of false negatives by increasing the number of potential buckets that could hold valid nearest-neighbor items.
The candidate generation algorithm works in two phases [See [32] for details]:
1. Pre-processing Phase: We construct L hash tables from the data by storing all elements x ∈ C. We only store pointers to the vector in the hash tables because storing whole data vectors is very memory inefficient. 2. Query Phase: Given a query Q; we search for its nearest-neighbors. We report the union from all of the buckets collected from the L hash tables. Note that we do not scan all the elements in C. Instead, we only probe L different buckets, one bucket for each hash table.
After generating the set of potential candidates, the nearest-neighbor is computed by comparing the distance between each item in the candidate set and the query.
LSH for Estimations and Sampling
Search with LSH is generally slow: Although, LSH provides provably fast retrieval in sub-linear time, LSH is known to be very slow for accurate search as it requires very large number of tables, i.e. large L. Also reducing the overheads of bucket aggregation and candidate filtering is a problem on its own.
On the contrary, the sampling view of LSH comes to light recently [33, 32, 5, 6, 20] . This idea alleviates costly searching by efficient sampling. It turns out that merely probing a few hash buckets (as low as 1) is sufficient for adaptive sampling. Observe that an item returned as candidate from a (K, L)-parameterized LSH algorithm is sampled with probability 1
where p is the collision probability of LSH function. The LSH family defines the precise form of p used to build the hash tables. It should be noted that this sampling probability is a monotonic function of collision probability p for any values of K and L. In theory, even a single hash table works. The sampling probability in turn is a monotonic function of similarity. Thus, with LSH algorithm the candidate set is an adaptive sampled set where the sampling probability changes with K and L.
This sampling view of LSH was the key ingredient for the algorithm proposed in paper [33] that shows the first possibility of adaptive dropouts in near-constant time, leading to efficient backpropagation algorithm.
MIPS Sampling
Recent advances in maximum inner product search (MIPS) using asymmetric locality sensitive hashing has made it possible to sample large inner products. Given an input, we first get the hash code H1 for the input, query the hash table for the first hidden layer and obtain the active neurons. We get the activations for only this set of active neurons. We do the same for the subsequent layers and obtain a final sparse output. Please note that the representative picture shows only one hash table per layer but we use multiple hash tables in practice.
For the sake of brevity, it is safe to assume that given a collection C of vectors and query vector Q, using (K, L)-parameterized LSH algorithm with MIPS hashing [30] , we get a candidate set S. Every element in x i ∈ C gets sampled into S with probability p i , where p i is a monotonically increasing function of Q · x i . Thus, we can pay a one-time linear cost of preprocessing C into hash tables, and any further adaptive sampling for query Q only requires few hash lookups.
Motivating Algorithm
Our proposal SLIDE builds on the recent line of observations, which show that while training, for every training data point, it is sufficient to sample very few neurons and perform the feedforward and backpropagation operation only on the sampled neurons [1, 22] . As a consequence, we can bypass a substantial number of multiplications if the sampling process is efficient. A good example of the presence of sparsity is the favorite activation function, ReLU (Rectified Linear Unit) [26] , which automatically sparsifies half of the neurons with zero activation. However, all current implementations do not take advantage of this sparsity as the utility of GPUs diminishes with sparsity [40] .
However, the activation of every neuron depends on the training data. To the best of our knowledge, without computing the activations of all neurons in one layer, there is no way to sample active neurons with higher probability. Computing the activation followed by sampling in proportion to the activation value is more costly than the original backpropagation itself.
[33] first shows that LSH algorithm naturally provides a unique form of adaptive sampling. Given any unseen input, it is possible to sample neurons in proportion to weights without computing the activations. We have described this theoretical advancement in LSH in previous section. Overall [33] presents the first possibility to significantly cheaper algorithm for training and testing with any neural network. Preliminary experiments demonstrate that we could reduce the algebraic computations involved by around 20 times without any loss in accuracy on small networks with a promise of even more savings for a larger network.
However, [33] only provides a proof of concept. This remarkable algorithm has several non-trivial overheads. Moreover, it is not clear if we can design a system that can outperform optimized Tensorflow-GPU implementations over powerful V100s which are several orders of magnitude faster in real practice than traditional CPUs.
In the next section, we introduce the design and implementation details of our system SLIDE (Sub-LInear Deep learning Engine). Note that SLIDE is implemented for CPUs, because it is not clear how to take advantage of extreme sparsity over GPUs. 
Introduction to the overall system
Before introducing SLIDE in details, we have to define some important notations that we use for this section in Table  3. 1. In addition, figure 1.1 illustrates the complete work-flow of SLIDE for a toy example of two hidden layer fully connected neural network.
Initialization: Figure 1 .1 shows the modular structure of SLIDE. Every layer object contains a list of neurons and a set of LSH sampling hash tables. Each hash table contains ids of the neurons that are hashed into the buckets. During the network initialization, the weights of the network are initialized randomly. After weight initialization, K × L LSH hash functions are initialized along with L hash tables for each of the layers. For instance, the example network in Figure 1 .1 maintains hash tables in two hidden layers as well as the output layer. We will get into the details of using various hash functions in Section 3.1.1. The LSH hash codes h l (w a l ) of the weight vectors of neurons in the given layer, are computed according to the hash functions. The id a of the neuron are saved into the hash buckets mapped by the LSH function h l (w a l ). This construction of LSH hash tables in each layer is a one time operation which can easily be parallelized with multiple threads over different neurons in the layer independently.
Sparse Feed-Forward Pass with Hash Table Sampling: In the feed-forward phase, given a single training instance, we compute the network activation until the final layer which gives us the output. In SLIDE, instead of calculating all the activations in each layer, the input to each layer x l is fed into hash functions to compute h l (x l ). The hash codes and N 3 1 , are directly treated as 0 and never computed. We describe our design choices in section 3.1.2 that reduces the sampling overheads significantly.
The above-described operations are performed sequentially in every layer starting from the very first layer where the input is the data itself. Even in the output layer, which has softmax activation, only neurons sampled from hash tables are treated as active neurons. For softmax, for every active neuron, we compute its output as
Note that the normalizing constant for softmax is no longer the sum over all neurons but only the active ones.
Sparse Backpropagation or Gradient Update:
The backpropagation step follows the feed-forward step. After computing the output of the network, we compare it with the known label of the input and backpropagate the errors layer-by-layer to calculate the gradient and update the weights. Here we used the old backpropagation message passing type implementation rather than vector multiplication based.
For every training data instance, after updating the weights of any given neuron, the neuron propagates the partial gradients (using error propagation) back to only active neurons in previous layers via the connected weights. As a result, we never access any non-active neuron or any non-active weight which is not part of the feed-forward process on a given input. The process ensures that we take full advantage of sparsity. Our computation over each input is only of the order of active neurons rather than the total number of neurons.
Update Hash Tables after Weight Updates: Also, after the weights are updated, we need to modify the positions of neurons in the hash tables accordingly. Updating neurons typically involves deletion from old bucket followed by an addition to the new bucket which can be significantly expensive. We introduce several design tricks that we use to overcome this overhead of updating hash tables in Section 3.1.3.
OpenMP Parallelization across Training Instances in a Batch: For any given training instance, both the feedforward and backpropagation operation are sequential as they need to be performed layer by layer. The clear advantage of SLIDE is that the total arithmetic operation due to extreme sparsity of neurons is notably less than the matrix multiplication operation. All operations are performed in a sparse fashion, where weights, layers, and neurons are accessed by their ids. Values of zeros are never involved in any memory accesses or computations. SLIDE uses usual Batch Gradient Descent with ADAM optimizer, where the batch size is generally in the order of hundreds. Each data instance in the batch runs in a separate thread and its gradients are computed in parallel. To ensure the independence of computation across different threads, every neuron stores two additional arrays, each of whose length is equal to the batchsize. These arrays keep track of the input specific neuron activations and error gradients. Every input is assigned an id, which can be used as an index to locate its activation (or error gradient) on any neuron. Besides, we also have a bit array at each neuron to determine whether the particular input activates a neuron or not. This small memory overhead is negligible for CPUs as they have abundant memory. But it ensures that the gradient computation is completely independent across different instances in the batch.
The extreme sparsity and randomness in gradient updates allow us to asynchronously parallelize the accumulation step of the gradient across different training data without leading to a considerable amount of overlapping updates. The theory of HOGWILD [28] shows that a small amount of overlap is tolerable. It does not hurt the convergence even if we resolve the concurrent updates randomly. SLIDE heavily capitalizes on this theory. Thus, after independently computing the gradients, each thread pushes the updates directly to the weights asynchronously. This asynchronous update avoids costly synchronization during batch accumulation which is otherwise sequential over different data in the batch.
In section 4.3, we observe that due to this asynchronous choice, we obtain near-perfect scaling of our implementation with an increasing number of cores. Such perfect scaling is particularly exciting because even highly optimized implementation of Tensorflow on CPUs shows poor scaling behavior with increasing cores beyond 16.
Details of Hash Functions and Hash Tables in Each Layer
SLIDE provides a natural trade off between the efficiency of retrieving active neurons and the quality of the retrieved ones. To facilitate this, we have three tunable parameters K, L, B. As mentioned in Section 2, L serves as the number of hash tables. To determine which bucket to choose, we use K hash codes for each hash table. Hence, SLIDE generates K × L randomized hash functions all belonging to one hash family for each layer. In every bucket in a hash table, the number of entries are limited to bucket size B. Such limit helps with the memory usage and also balances the load on threads during parallel aggregation of neurons.
In our implementation of SLIDE, we support four types of hash functions from LSH family: 1) Simhash 2) WTA hash 3) DWTA hash and 4) Minhash respectively. Each of these hash families preserve different similarities and hence are useful for various scenerios. We discuss the implementation details of these hash families in the subsequent paragraphs. In addition, SLIDE also provides the interface to add customized hash functions based on need.
Signed Random Projection (Simhash) : Refer [8] for explanation of the theory behind Simhash. We use K × L number of random pre-generated vectors with components taking only three values {+1, 0, −1}. The reason behind using only +1s and −1s is for fast implementation. It requires additions rather than multiplications thereby reducing the computation and speeding up the hashing process. To further optimize the cost of Simhash in practice, we can adopt the sparse random projection idea [19] . A simple implementation is to treat the random vectors as sparse vectors and store their nonzero indices in addition to the signs. For instance, let the input vector for Simhash be in R d . Suppose we want to maintain 1/3 sparsity, we may uniformly generate K * L set of d/3 indices from [0, d − 1]. In this way, the number of multiplications for one inner product operation during the generation of the hash codes would simply reduce from d to d/3. Since the random indices are produced from one time generation, the cost can be safely ignored.
Winner Takes All Hashing (WTA hash) : In SLIDE, we slightly modify the WTA hash algorithm from [39] Densified Winner Takes All Hashing (DWTA hash) : As argued in [4] , when input vector is very sparse, WTA hashing no longer produces representative hash codes. Therefore, we use DWTA hashing, the solution proposed in [4] . Similar to WTA hash, we generate It should be noted that the number of comparisons and memory look-ups in this step is O(N N Z * KLm d ), which is significantly more efficient than simply applying WTA hash to sparse input. For empty bins, the densification scheme proposed in [4] is applied.
Densified One Permutation Minwise Hashing (DOPH) :
The implementation mostly follows the description of DOPH in [31] . DOPH is mainly designed for binary inputs. However, the weights of the inputs for each layer are unlikely to be binary. We use a thresholding heuristic for transforming the input vector to binary representation before applying DOPH. The k highest values among all d dimensions of the input vector are converted to 1s and the rest of them become 0s. Define idx k as the indices of the top k values for input vector x. Formally,
We could use sorting algorithms to get the top k indices but it induces at least O(dlogd) overhead. Therefore, we keep a priority queue with indices as keys and the corresponding data values as values. This requires O(dlogk) operations.
Reducing the Sampling Overhead
The key idea of using LSH for adaptive sampling of neurons with large activation is sketched in Section 3.1. We have designed three strategies to sample large inner products: 1) Vanilla Sampling 2) Topk Sampling 3) Hard Thresholding. We first introduce them one after the other and then discuss their utility and efficiency. Further experiments are reported in Section 4.
Vanilla Sampling: Denote β l as the number of active neurons we target to retrieve in layer l. After computing the hash codes of the input, we randomly chose a table and only retrieve the neurons in that table. We continue retrieving neurons from another random table until β l neurons are selected or all the tables have been looked up. Let us assume we retrieve from τ tables in total. Formally, the probability that a neuron N j l gets chosen is, where p is the collision probability of the LSH function that SLIDE uses. For instance, if Simhash is used,
π .
From the previous process, we can see that the time complexity of vanilla sampling is O(β l ).
TopK Sampling: In this strategy, the basic idea is to obtain those neurons that occur more frequently among all L hash tables. After querying with the input, we first retrieve all the neurons from the corresponding bucket in each hash The TopK Sampling could be expensive due to the sorting step. To overcome this, we propose a simple variant that collects all neurons that occur more than a certain frequency. This bypasses the sorting step and also provides a guarantee on the quality of sampled neurons. Suppose we only select neurons that appear at least m times in the retrieved buckets, the probability that a neuron N j l gets chosen is, Figure 4 shows a sweep of curves that present the relation between collision probability of h l (w j l ) and h l (x l ) and the probability that neuron N j l is selected under various values of m when L = 10. We can visualize the trade off between collecting more good neurons and omitting bad neurons by tweaking m. For a high threshold like m = 9, only the neurons with p > 0.8 has more than P r > 0.5 chance of retrieval. This ensures that bad neurons are eliminated but the retrieved set might be insufficient. However, for a low threshold like m = 1, all good neurons are collected but bad neurons with p < 0.2 are also collected with P r > 0.8. Therefore, depending on the tolerance for bad neurons, we choose an intermediate m in practice.
Reducing the Cost of Updating Hash Tables
We introduce several heuristics for addressing the expensive costs of updating the hash tables:
• As mentioned in Sections 3.1, due to the gradient updates in back propagation, the weights of neurons changes over iterations. In theory, we should recompute the hash code representations of the neurons and update the hash tables accordingly every time the weights change. However, such updates are computationally expensive. Therefore, we dynamically change the update frequency of hash tables to reduce the overhead. Assume N 0 is the initial update frequency and t − 1 is the number of times the hash tables have already been updated. We . The x-axis is plotted in log scale to accommodate the otherwise slow Tensorflow-CPU curve. We notice that the time required for convergence is 2.7x lower than that of Tensorflow-GPU. When compared against iterations, the convergence behavior is identical, which confirms that the superiority of SLIDE is due to algorithm and implementation and not due to any optimization bells and whistles. apply exponential decay on the update frequency such that the t th hash table update happens on iteration
, where λ is a tunable decay constant. The intuition behind this scheme is that the gradient updates in the initial stage of the training are larger than those in the later stage, especially while close to convergence.
• Besides the overhead in time for hash table updates, hash tables with skewed buckets due to variable number of neurons created additional memory, computation and parallelization overheads. To get around this, we fix the bucket size B for all hash tables. However, SLIDE needs a policy for adding a new neuron to a bucket when it is full. To solve such problem, we use the same solution in [38] that make use of Vitter's reservoir sampling algorithm [36] as the replacement strategy. It was shown that reservoir sampling retains the adaptive sampling property of LSH tables making the process sound. In addition, for further speed up, we implement a simpler alternative policy that based on FIFO (First In First Out).
• For Simhash, the hash codes are computed by h On Amazon-670K dataset, we notice that Sampled Softmax starts to grow faster than SLIDE in the beginning stages of training but saturates quickly to a lower accuracy. SLIDE starts to grow slowly but attains much higher accuracy than Sampled Softmax. SLIDE has the context of choosing most informative neurons at each layer. Sampled Softmax always chooses a random subset of neurons in the final layer. This reflects in the superior performance of SLIDE over Sampled Softmax.
Experiments
Our goal is to answer the following questions empirically:
1. How is the performance and accuracy on SLIDE on modest CPU, with few cores, compared with the popular Tensorflow implementation of back-propagation on state-of-the-art massively parallel hardware such as V100s? We want to observe the complete spectrum for a thorough comparison.
2. It is known that large batchsize is the biggest driver of efficiency on existing GPU implementations. Thus, it is imperative to know whether a change in batchsize affects the conclusions?
3. How is the performance and accuracy on SLIDE on modest CPU with few cores compared with the popular Tensorflow implementation of back-propagation on the same CPU?
4. How does SLIDE scale with increasing number of cores on CPUs? Is the scaling comparable or even better than the scaling of popular implementations on the same hardware?
5. Is there any advantage of LSH based adaptive sampling for sparsifying neurons, which is our main proposal? It is highly possible that for the datasets at hand plain random sampling can achieve the same accuracy in much less cost.
What are the benefits and tradeoffs of different design choices mentioned in Section 3.1?
Fully-Connected Large Architecture: Fully connected networks are common in practice and dominate most applications except vision where the use of convolutional neural networks (CNN) are more pronounced. Thus, our evaluation is limited to only fully connected architectures. We also choose large networks where even a slight decrease in performance is noticeable. Thus, the extreme classification datasets [17] , which are publicly available and require more than 100 million parameters to train due to their extremely wide last layer, fit this setting appropriately. For these tasks most of the computations (more than 99%), is in the final layer.
Datasets:
We employ two large real datasets: Delicious-200K and Amazon-670K. Both the datasets are obtained from the Extreme Classification Repository [17] . Description of the datasets are listed below, and detailed statistics about the dimensions and samples sizes are included in Table 4 :
• Delicious-200K dataset is a sub-sampled dataset generated from a vast corpus of almost 150 million bookmarks from Social Bookmarking Systems, del.icio.us. The corpus records all the bookmarks along with a description, provided by users (default as the title of the website), an extended description and tags they consider related.
• Amazon-670K dataset is a product recommendation dataset with 670K labels. Here, each input is a vector representation of a product, and the corresponding labels are other products (among 670K choices) that a user might be interested in purchase. This is an anonymized and aggregated behavior data from Amazon and poses a significant challenge owing to a large number of classes.
Infrastructure: All the experiments are conducted on a server equipped with 44-core processors (Intel Xeon E5-2699A v4 2.40GHz) and one NVIDIA Tesla V100 Volta 32GB GPU. The server has an Ubuntu 16.04.5 LTS system with the installation of Tensorflow-GPU 1.12 from Python's pip package manager. Since CPU is at a natural disadvantage against GPU, we compiled Tensorflow-CPU 1.12 from source with GCC5.4 in order to support FMA, AVX, AVX2, SSE4.1, and SSE4.2 instructions. This boosts the performance of Tensorflow-CPU by about 35%. SLIDE is written in C++ and compiled under GCC5.4 with OpenMP flag. SLIDE currently does not exploit any advantage of any kind of parallel instructions on CPUs. Thus, FMA, AVX, AVX2, SSE4.1, and SSE4.2 instructions do not affect the performance of SLIDE. The most exciting part is that SLIDE only uses vanilla CPU thread parallelism and yet outperforms Tensorflow-GPU (V100) by a large margin in performance.
Baselines:
We benchmark the tasks with our system SLIDE(CPU only), and compare the performance to the popular highly optimized Tensorflow framework for both CPU and GPU. Specifically, the comparison is between the same tasks, with the exact same architecture, running on Tensorflow-CPU and Tensorflow-GPU. The optimizer and the learning hyperparameters (details later) were also the same to avoid unfair comparisons.
Most of the computations in our architecture are in the softmax layer. Besides, we also compare against the popular sampled softmax algorithm [13] which is a fast proxy to full softmax. We use the optimized Sampled Softmax functionality provided in Tensorflow-GPU. In principle, both SLIDE and Sampled Softmax accelerate the training in the same way, i.e., by selecting a few neurons and passing gradients only from those neurons. While Sampled Softmax makes a naive static sampling of neurons, SLIDE uses adaptive sampline which is known to be superior in deep learning literature [41] . The comparison of Sampled Softmax with SLIDE sheds light on the necessity of LSH based input dependent adaptive sampling compared to static sampling scheme which is the only other sampling alternative in the literature.
Hyper Parameters: For both the datasets, we adopt the same model architecture in [41] . More specifically, we choose the standard fully connected neural network with one hidden layer of size 128. We choose a batch size of 128 for Delicious-200K dataset and 256 for Amazon-670K dataset. We chose a smaller batch size for Delicious-200K dataset because its input dimension is much larger compared to Amazon-670K as shown in Table 4 . We run all algorithms until convergence. To quantify the superiority of SLIDE over other baselines, we also use the same optimizer, Adam [16] by varying the initial step size from 1e −5 to 1e −3 which leads to better convergence in all experiments. For SLIDE setting, we decide to only maintain the hash tables for active neuron retrieval in the last layer, where we have a computational bottleneck of the models (owing to the large number of classes). For specific LSH setting, we choose Simhash, K = 9, L = 50 for Delicious dataset and WTA hash, K = 8, L = 50 for Amazon-670k dataset. We update the hash tables with an initial update period of 50 iterations and then exponentially decaying frequency as mentioned in Section 3.1.3.
To characterize the complete spectrum, we plot the whole learning curve with both wall clock time and the number of iterations. We compare the full plot for all the baselines on both of the two datasets.
Results:
We show the time-wise and iteration-wise comparisons for SLIDE vs Tensorflow GPU/CPU in Figure 5 . Note that the x-axis is in log-scale, and all the curves have a long flat converged portion when plotted on a linear scale # Samples Tables. indicating clear convergence behavior. Red, blue and black lines represent the performance of SLIDE, Tensorflow-GPU, Tensorflow-CPU respectively. We can see from the plots that SLIDE on CPU achieves any accuracy faster than Tensorflow on V100 demonstrating the superiority of SLIDE. Tensorflow-GPU is always faster than Tensorflow-CPU which is expected. It should be noted that these datasets are very sparse, e.g. Delicious dataset has only 75 non-zeros on an average for input features, and hence the advantage of GPU over CPU is not always noticeable. But V100 is a powerful GPU and despite high sparsity in the data features, can still outperform the CPU variant. SLIDE can be around 1.8 times faster than Tensorflow-GPU on Delicious 200k. On the larger Amazon 670k dataset, where we need more computations, the gains are substantially more. SLIDE is around 2.7 (2 hrs vs. 5.5 hrs) times faster than Tensorflow-GPU.
Most of the computational benefits of SLIDE come from sampling a small subset of active neurons in the output layer. After few iterations into the training process, the average number of neurons sampled in the output layer for Delicious-200K is ≈ 1000. Similarly, for Amazon-670K, we sample ≈ 3000 neurons. With fewer than 0.5% of active neurons, SLIDE outperforms Tensorflow-GPU on time by huge margin on either dataset.
It is interesting to note that even after compiling Tensorflow-CPU with AVX2 instructions, it is nowhere close to the performance of SLIDE or Tensorflow-GPU. Therefore, it is exciting to note that without any rigorous optimization in our prototype, SLIDE outperforms both baselines using smart randomized algorithms with OpenMP parallelism.
For Iteration vs. Accuracy plots in Figure 5 , we can observe that SLIDE achieves the same accuracy per iteration even though it adaptively selects neurons in some layers. This observation also confirms that adaptively selecting neurons and performing asynchronous SGD does not hurt the convergence from an optimization perspective. The plot also confirms that the advantage of SLIDE is not due to any bells and whistles in the optimization process as the convergence with iteration has very similar behavior. For this plot, we only show Tensorflow-GPU as Tensorflow-CPU would also lead to the same plot as the optimization algorithm is the same.
Since SLIDE performs much fewer computations and memory accesses on the last layer, each iteration is faster than the baselines. This is the critical reason why SLIDE outperform other baselines when compared on wall-clock time.
Comparisons over other Heuristics
During the full softmax process in training on Tensorflow, for every training example, it needs to compute logits (output of the last layer before applying softmax function) for all classes. This step is followed by computing the softmax (normalized sigmoid) of logits. In extreme classification tasks (with large number of classes), computing these logits gets expensive. Therefore, there has been a line of research working on reducing this cost [24, 2, 10] . The most common methods are sampling-based (static sampling weights) methods which shortlist a candidate set of classes for every batch of training data. By doing this, the number of computed logits gets reduced significantly. Due to its popularity, Tensorflow supports an optimized implementation of sampled softmax [13] .
We explore how sampled softmax on Tensorflow-GPU performs compared with SLIDE on the extreme classification tasks. As mentioned earlier, LSH sampling process in SLIDE is principally very similar to the process of sampled softmax but with sampling probabilities changing dynamically with inputs. We adopt the exact same settings in the previous section for the experiments. Recall that the average number of sampled classes for SLIDE for both the datasets is ≈ 0.5%. For sampled softmax, we try a various number of samples for the sampling process. However, with a comparable number of samples, sampled softmax leads to poor accuracy. We empirically observe that we have to sample 20% of the total number of classes to obtain any decent accuracy.
The results are shown in Figure 6 . The red lines represent SLIDE, and the green lines represent sampled softmax on Tensorflow-GPU. We can see that both time and iteration wise, the red lines outperform the green lines significantly. Sampled softmax uses static sampling strategies which are fast compared to SLIDE which in contrast uses adaptively changing hash tables for input specific dynamic sampling. Unfortunately, the uninformative static sampling of softmax leads to poor accuracy as shown by the plot. It should be noted that in these plots, Sampled softmax uses significantly more neurons than SLIDE and still shows poor convergence behavior. Figure 6 clearly confirms the need for adaptive sampling of neurons (in proportion to input dependent activation) for sparsifying neural networks in order to retain good convergence. Without adaptive sampling we get very poor convergence. This phenomenon supports our choice of LSH based adaptive sampling.
Effect of Batch Size
Batch size is a crucial parameter that can affect the training speed and model quality in Machine Learning. In general, a large batch size may help in reducing the training time per epoch as we process more gradient updates at a time [9] . But large batches are known to be bad from optimization perspective as they reduce the generalization capability [15] . In the case of extreme classification datasets, the number of computations performed is huge owing to large input dimension and a large number of classes. Hence, a larger batch size may not necessarily translate into faster training per epoch. To clarify this, we study the effect of varying batch size on the results. We choose the larger Amazon-670k dataset for this task.
Irrespective of the batch size, we observe that SLIDE outperforms Tensorflow-GPU by a significant margin as shown in figure 7 . This observation could be attributed to the fact that SLIDE performs very few computations per instance. Our data structures allow us to process all samples in a batch in parallel, and the gradient updates are made asynchronously among threads as described in section 3.1, which enables effective use of parallel threads and it reflects in superior performance over Tensorflow. It is interesting to note that the gap between SLIDE and Tensorflow widens as the batch size grows from 64 to 256.
Scalability Tests
In previous sections, we have demonstrated the superiority of SLIDE over Tensorflow-GPU, CPU and sampled softmax. In this section, we try to understand the effect of increasing CPU cores on the scalability of SLIDE and Tensorflow-CPU. Besides, we intend to know the number of cores SLIDE needs to outperform Tensorflow. As mentioned before, the machine has 44 cores, and each core has 2 threads. To avoid the overhead and complication of using both threads in the same core, we enforce using one thread per core. Hence, the effective number of threads and cores is the same. We interchangeably use the words "threads" and "cores" from here on. We benchmark both frameworks with 2, 4, 8, 16, 32, 44 threads.
We replicate the setting of the experiments described in Section 4. For the different number of threads, we run the same classification experiments on SLIDE and Tensorflow-CPU for both datasets and clock the corresponding convergence time. Figure 8 presents the results. The red, blue, black lines represent SLIDE, Tensorflow-GPU, and Tensorflow-CPU respectively. It should be noted that the blue line is flat because GPU computations were done on V100 with thousands of cores and are mostly oblivious about the number of CPU cores. When the number of cores increases, the convergence time for both SLIDE and Tensorflow-CPU starts to decrease. This decrease is expected due to the benefits brought by more parallelism on each training batch. For Delicious dataset, the red line and the black line cross each other around 8 cores, which means that with around than 8 cores, SLIDE can beat Tensorflow-CPU. The red and blue lines intersect between 16 and 32 cores. Hence, with fewer than 32 cores, SLIDE outperforms Tensorflow-GPU on Delicious dataset. Similarly, for larger Amazon dataset, the red and black line never intersect, and the red and blue line intersect on 8 cores. This means that SLIDE beats Tensorflow-GPU with as few as 8 CPU cores and Tensorflow-CPU with as few as 2 CPU cores.
Moreover, based on the statistics collected through experiments as mentioned above, we show the ratio of convergence time with the different number of cores to the minimum convergence time (using 44 cores). The results are exhibited in Figure 8 . Again, the red line represents SLIDE, and the black line represents Tensorflow-CPU. When the number of cores increases, that ratio decreases for both SLIDE and Tensorflow-CPU. However, it is explicit that the ratio drops more drastically for the red line than the black line. This behavior concludes that the scalability of SLIDE is much Figure 10 : Inefficiencies in CPU Usage: We observe that memory bound inefficiencies (orange bars) are the most significant ones for either algorithm. For Tensorflow-CPU, memory bound inefficiency rises with increasing number of cores. This corroborates our previous observation (in Figure 8 ) that performance of TF-CPU stalls after 16 cores. For SLIDE, the memory bottleneck reduces with increasing number of cores. Hence, SLIDE takes better advantage of higher CPU cores.
better than that of Tensorflow-CPU. Moreover, in the plot, we observe that the benefits of using more cores are not obvious after 16 cores for Tensorflow-CPU. Coincidentally, a very recent work [11] introduces the hardness of finding the optimal parameter settings of Tensorflow's threading model for CPU backends. It argues that getting the best performance from a CPU needs manual, tedious and time-consuming tuning and it still may not guarantee the best performance. While analyzing the scalability and core utilization of Tensorflow-CPU can be an independent research interest, we explore a small aspect of it in the following paragraphs.
Inefficiency Diagnosis:
We profile and analyze Tensorflow-CPU and SLIDE by a state-of-the-art parallel performance analyzer tool, the Intel VTune Performance Analyzer [23] . Table 4 .3 exhibits the results for core utilization comparison between both frameworks using 8, 16, 32 threads for the above tasks. We can see that for Tensorflow-CPU, the utilization is generally low (< 50%). It further decreases with more threads. For SLIDE, the core utilization is stable (around 80%) across all threads presented in the table.
Moreover, Figure 4 .3 presents the distribution of inefficiencies in CPU usage for Tensorflow-CPU and SLIDE. It should be noted that according to 4.3, the overall inefficiencies of Tensorflow-CPU is much more than those of SLIDE in general. Thus the distribution in plot 4.3 is based on those inefficiencies. It is obvious that being memory bound is a major issue for all number of threads in the histogram. The biggest bottleneck is that the significant fraction of execution pipeline slots is stalled due to demand memory load and store. An interesting observation is that the higher the number of cores Tensorflow-CPU uses, the more memory bound it becomes. On the other hand, the higher the number of cores SLIDE uses, the less memory bound it becomes. Recall that the critical advantage of SLIDE is that it has a lot fewer active neurons and sparse gradient updates. Naturally, memory accesses are a lot fewer than Tensorflow-CPU due to very sparse memory accesses within each thread.
In SLIDE, our choice of using extra arrays to separate the computations of each thread and asynchronous accumulation of gradients (section 3.1) across all the threads ensures that simple OpenMP parallelism is sufficient to get near-peak utilization.
Design Choice Comparisons
In Section 3.1, we present several design choices in SLIDE which have different trade-offs and performance behavior, e.g., executing MIPS efficiently to select active neurons, adopting the optimal policies for neurons insertion in hash tables, etc. In this section, we substantiate those design choices with key metrics and insights. In order to better analyze them in more practical settings, we choose to benchmark them in real classification tasks on Delicious-200K dataset. See Section 4 for detailed settings. 
Evaluating Sampling Strategies
Sampling is a crucial step in SLIDE. The quality and quantity of selected neurons and the overhead of the selection strategy significantly affect the SLIDE performance. We profile the running time of these strategies, including Vanilla sampling, TopK thresholding, and Hard thresholding, for selecting a different number of neurons from the hash tables during the first epoch of the classification task. Figure 9 presents the results. The blue, red and green dots represent Vanilla sampling, TopK thresholding, and Hard thresholding respectively. It shows that the TopK thresholding strategy takes magnitudes more time than Vanilla sampling and Hard thresholding across all number of samples consistently. Also, we can see that the green dots are just slightly higher than the blue dots meaning that the time complexity of Hard Thresholding is slightly higher than Vanilla Sampling. Note that the y-axis is in log scale. Therefore when the number of samples increases, the rates of change for the red dots are much more than those of the others. This is not surprising because TopK thresholding strategy is based on sorting algorithms which has O(nlogn) running time. Therefore, in practice, we suggest choosing either of Vanilla Sampling or Hard Thresholding for efficiency. For instance, we use Vanilla Sampling in our extreme classification experiments because it is the most efficient one. Furthermore, the difference between iteration wise convergence of the tasks with TopK Thresholding and Vanilla Sampling are negligible.
Addition to Hashtables
SLIDE supports two implementations of insertion policies for hash tables described in Section 3.1. We profile the running time of the two strategies, Reservoir Sampling and FIFO. After the weights and hash tables initialization, we clock the time of both strategies for insertions of all 205,443 neurons in the last layer of the network, where 205,443 is the number of classes for Delicious dataset. Then we also benchmark the time of whole insertion process including generating the hash codes for each neuron before inserting them into hash tables.
The results are shown in Table 4 .4.1. The column "Full Insertion" represents the overall time for the process of adding all neurons to hash tables. The column "Insertion to HT" represents the exact time of adding all the neurons to hash tables excluding the time for computing the hash codes. Reservoir Sampling strategy is more efficient than FIFO. From an algorithmic view, Reservoir Sampling inserts based on some probability, but FIFO guarantees successful insertions. We observe that there are more memory accesses with FIFO. However, compared to the full insertion time, the benefits of Reservoir Sampling are still negligible. Therefore we can choose either strategy based on practical utility. For instance, we use FIFO in our experiments in Section 4.
Future Work
SLIDE currently only supports fully connected architecture and multi-core parallelism. Naturally, our next steps are to extend SLIDE to include convolutional layers. SLIDE has unique benefits when it comes to random memory access and parallelism. We anticipate that a distributed implementation of SLIDE would be very appealing in many ways especially given our current results. Because our gradient updates are sparse, the communication costs are minimized in distributed setting. Finally, SLIDE did not take advantage of any parallel optimized CPU instructions. It is an exciting direction to explore whether we can leverage parallel CPU instructions to speed up SLIDE even further.
Conclusion
In this paper, we provide the first evidence that a smart algorithm with modest CPU OpenMP parallelism can outperform the best available hardware, the NVIDIA V100, for training large deep learning architectures. Our system SLIDE is a combination of carefully tailored randomized hashing algorithms with the right data structures that allow asynchronous parallelism.
Currently, there is a prevailing wisdom that hardware acceleration is the future of large-scale deep learning. We hope this paper will compel the community to rethink algorithmic alternatives for back-propagation before making their significant investment in the hardware.
