788 research outputs found
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables
A lot of recent progress has been made in ultra low-bit quantization,
promising significant improvements in latency, memory footprint and energy
consumption on edge devices. Quantization methods such as Learned Step Size
Quantization can achieve model accuracy that is comparable to full-precision
floating-point baselines even with sub-byte quantization. However, it is
extremely challenging to deploy these ultra low-bit quantized models on
mainstream CPU devices because commodity SIMD (Single Instruction, Multiple
Data) hardware typically supports no less than 8-bit precision. To overcome
this limitation, we propose DeepGEMM, a lookup table based approach for the
execution of ultra low-precision convolutional neural networks on SIMD
hardware. The proposed method precomputes all possible products of weights and
activations, stores them in a lookup table, and efficiently accesses them at
inference time to avoid costly multiply-accumulate operations. Our 2-bit
implementation outperforms corresponding 8-bit integer kernels in the QNNPACK
framework by up to 1.74x on x86 platforms
Four Soviets Walk the Dog-Improved Bounds for Computing the Fr\'echet Distance
Given two polygonal curves in the plane, there are many ways to define a
notion of similarity between them. One popular measure is the Fr\'echet
distance. Since it was proposed by Alt and Godau in 1992, many variants and
extensions have been studied. Nonetheless, even more than 20 years later, the
original algorithm by Alt and Godau for computing the Fr\'echet
distance remains the state of the art (here, denotes the number of edges on
each curve). This has led Helmut Alt to conjecture that the associated decision
problem is 3SUM-hard.
In recent work, Agarwal et al. show how to break the quadratic barrier for
the discrete version of the Fr\'echet distance, where one considers sequences
of points instead of polygonal curves. Building on their work, we give a
randomized algorithm to compute the Fr\'echet distance between two polygonal
curves in time on a pointer machine
and in time on a word RAM. Furthermore, we show that
there exists an algebraic decision tree for the decision problem of depth
, for some . We believe that this
reveals an intriguing new aspect of this well-studied problem. Finally, we show
how to obtain the first subquadratic algorithm for computing the weak Fr\'echet
distance on a word RAM.Comment: 34 pages, 15 figures. A preliminary version appeared in SODA 201
A subquadratic algorithm for 3XOR
Given a set of binary words of equal length , the 3XOR problem
asks for three elements such that , where denotes the bitwise XOR operation. The problem can be easily solved on
a word RAM with word length in time . Using Han's fast
integer sorting algorithm (2002/2004) this can be reduced to . With randomization or a sophisticated deterministic dictionary
construction, creating a hash table for with constant lookup time leads to
an algorithm with (expected) running time . At present, seemingly no
faster algorithms are known. We present a surprisingly simple deterministic,
quadratic time algorithm for 3XOR. Its core is a version of the Patricia trie
for , which makes it possible to traverse the set in ascending
order for arbitrary in linear time.
Furthermore, we describe a randomized algorithm for 3XOR with expected
running time . The
algorithm transfers techniques to our setting that were used by Baran, Demaine,
and P\u{a}tra\c{s}cu (2005/2008) for solving the related int3SUM problem (the
same problem with integer addition in place of binary XOR) in expected time
. As suggested by Jafargholi and Viola (2016), linear hash functions
are employed. The latter authors also showed that assuming 3XOR needs expected
running time one can prove conditional lower bounds for triangle
enumeration just as with 3SUM. We demonstrate that 3XOR can be reduced to other
problems as well, treating the examples offline SetDisjointness and offline
SetIntersection, which were studied for 3SUM by Kopelowitz, Pettie, and Porat
(2016)
Optimal Packed String Matching
In the packed string matching problem, each machine word accommodates α characters, thus an n-character text occupies n/α memory words. We extend the Crochemore-Perrin constantspace O(n)-time string matching algorithm to run in optimal O(n/α) time and even in real-time, achieving a factor α speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC 0 instructions (i.e., no multiplication) plus two specialized AC 0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e., Intelâs SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC 0) and table lookup
Towards optimal packed string matching
a r t i c l e i n f o a b s t r a c t Dedicated to Professor Gad M. Landau, on the occasion of his 60th birthday Keywords: String matching Word-RAM Packed strings In the packed string matching problem, it is assumed that each machine word can accommodate up to α characters, thus an n-character string occupies n/α memory words. The main word-size string-matching instruction wssm is available in contemporary commodity processors. The other word-size maximum-suffix instruction wslm is only required during the pattern pre-processing. Benchmarks show that our solution can be efficiently implemented, unlike some prior theoretical packed string matching work. (b) We also consider the complexity of the packed string matching problem in the classical word-RAM model in the absence of the specialized micro-level instructions wssm and wslm. We propose micro-level algorithms for the theoretically efficient emulation using parallel algorithms techniques to emulate wssm and using the Four-Russians technique to emulate wslm. Surprisingly, our bit-parallel emulation of wssm also leads to a new simplified parallel random access machine string-matching algorithm. As a byproduct to facilitate our results we develop a new algorithm for finding the leftmost (most significant) 1 bits in consecutive non-overlapping blocks of uniform size inside a word. This latter problem is not known to be reducible to finding the rightmost 1, which can be easily solved, since we do not know how to reverse the bits of a word in O (1) time
Image Display and Manipulation System (IDAMS) program documentation, Appendixes A-D
The IDAMS Processor is a package of task routines and support software that performs convolution filtering, image expansion, fast Fourier transformation, and other operations on a digital image tape. A unique task control card for that program, together with any necessary parameter cards, selects each processing technique to be applied to the input image. A variable number of tasks can be selected for execution by including the proper task and parameter cards in the input deck. An executive maintains control of the run; it initiates execution of each task in turn and handles any necessary error processing
- âŠ