788 research outputs found

    DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

    Full text link
    A lot of recent progress has been made in ultra low-bit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with sub-byte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74x on x86 platforms

    Four Soviets Walk the Dog-Improved Bounds for Computing the Fr\'echet Distance

    Get PDF
    Given two polygonal curves in the plane, there are many ways to define a notion of similarity between them. One popular measure is the Fr\'echet distance. Since it was proposed by Alt and Godau in 1992, many variants and extensions have been studied. Nonetheless, even more than 20 years later, the original O(n2log⁥n)O(n^2 \log n) algorithm by Alt and Godau for computing the Fr\'echet distance remains the state of the art (here, nn denotes the number of edges on each curve). This has led Helmut Alt to conjecture that the associated decision problem is 3SUM-hard. In recent work, Agarwal et al. show how to break the quadratic barrier for the discrete version of the Fr\'echet distance, where one considers sequences of points instead of polygonal curves. Building on their work, we give a randomized algorithm to compute the Fr\'echet distance between two polygonal curves in time O(n2log⁥n(log⁥log⁥n)3/2)O(n^2 \sqrt{\log n}(\log\log n)^{3/2}) on a pointer machine and in time O(n2(log⁥log⁥n)2)O(n^2(\log\log n)^2) on a word RAM. Furthermore, we show that there exists an algebraic decision tree for the decision problem of depth O(n2−Δ)O(n^{2-\varepsilon}), for some Δ>0\varepsilon > 0. We believe that this reveals an intriguing new aspect of this well-studied problem. Finally, we show how to obtain the first subquadratic algorithm for computing the weak Fr\'echet distance on a word RAM.Comment: 34 pages, 15 figures. A preliminary version appeared in SODA 201

    A subquadratic algorithm for 3XOR

    Get PDF
    Given a set XX of nn binary words of equal length ww, the 3XOR problem asks for three elements a,b,c∈Xa, b, c \in X such that a⊕b=ca \oplus b=c, where ⊕ \oplus denotes the bitwise XOR operation. The problem can be easily solved on a word RAM with word length ww in time O(n2log⁡n)O(n^2 \log{n}). Using Han's fast integer sorting algorithm (2002/2004) this can be reduced to O(n2log⁡log⁡n)O(n^2 \log{\log{n}}). With randomization or a sophisticated deterministic dictionary construction, creating a hash table for XX with constant lookup time leads to an algorithm with (expected) running time O(n2)O(n^2). At present, seemingly no faster algorithms are known. We present a surprisingly simple deterministic, quadratic time algorithm for 3XOR. Its core is a version of the Patricia trie for XX, which makes it possible to traverse the set a⊕Xa \oplus X in ascending order for arbitrary a∈{0,1}wa\in \{0, 1\}^{w} in linear time. Furthermore, we describe a randomized algorithm for 3XOR with expected running time O(n2⋅min⁡{log⁡3w/w,(log⁡log⁡n)2/log⁡2n})O(n^2\cdot\min\{\log^3{w}/w, (\log\log{n})^2/\log^2 n\}). The algorithm transfers techniques to our setting that were used by Baran, Demaine, and P\u{a}tra\c{s}cu (2005/2008) for solving the related int3SUM problem (the same problem with integer addition in place of binary XOR) in expected time o(n2)o(n^2). As suggested by Jafargholi and Viola (2016), linear hash functions are employed. The latter authors also showed that assuming 3XOR needs expected running time n2−o(1)n^{2-o(1)} one can prove conditional lower bounds for triangle enumeration just as with 3SUM. We demonstrate that 3XOR can be reduced to other problems as well, treating the examples offline SetDisjointness and offline SetIntersection, which were studied for 3SUM by Kopelowitz, Pettie, and Porat (2016)

    Optimal Packed String Matching

    Get PDF
    In the packed string matching problem, each machine word accommodates α characters, thus an n-character text occupies n/α memory words. We extend the Crochemore-Perrin constantspace O(n)-time string matching algorithm to run in optimal O(n/α) time and even in real-time, achieving a factor α speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC 0 instructions (i.e., no multiplication) plus two specialized AC 0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e., Intel’s SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC 0) and table lookup

    Towards optimal packed string matching

    Get PDF
    a r t i c l e i n f o a b s t r a c t Dedicated to Professor Gad M. Landau, on the occasion of his 60th birthday Keywords: String matching Word-RAM Packed strings In the packed string matching problem, it is assumed that each machine word can accommodate up to α characters, thus an n-character string occupies n/α memory words. The main word-size string-matching instruction wssm is available in contemporary commodity processors. The other word-size maximum-suffix instruction wslm is only required during the pattern pre-processing. Benchmarks show that our solution can be efficiently implemented, unlike some prior theoretical packed string matching work. (b) We also consider the complexity of the packed string matching problem in the classical word-RAM model in the absence of the specialized micro-level instructions wssm and wslm. We propose micro-level algorithms for the theoretically efficient emulation using parallel algorithms techniques to emulate wssm and using the Four-Russians technique to emulate wslm. Surprisingly, our bit-parallel emulation of wssm also leads to a new simplified parallel random access machine string-matching algorithm. As a byproduct to facilitate our results we develop a new algorithm for finding the leftmost (most significant) 1 bits in consecutive non-overlapping blocks of uniform size inside a word. This latter problem is not known to be reducible to finding the rightmost 1, which can be easily solved, since we do not know how to reverse the bits of a word in O (1) time

    Weighted Random Sampling - Alias Tables on the GPU

    Get PDF

    Image Display and Manipulation System (IDAMS) program documentation, Appendixes A-D

    Get PDF
    The IDAMS Processor is a package of task routines and support software that performs convolution filtering, image expansion, fast Fourier transformation, and other operations on a digital image tape. A unique task control card for that program, together with any necessary parameter cards, selects each processing technique to be applied to the input image. A variable number of tasks can be selected for execution by including the proper task and parameter cards in the input deck. An executive maintains control of the run; it initiates execution of each task in turn and handles any necessary error processing
    • 

    corecore