Search CORE

788 research outputs found

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

Author: Ashfaq Saad
AskariHemmat MohammadHossein
Ganji Darshan C.
Hassanien Ahmed
Hoffman Alexander
Léonardon Mathieu
Mitra Saptarshi
Saboori Ehsan
Sah Sudhakar
Publication venue
Publication date: 18/04/2023
Field of study

A lot of recent progress has been made in ultra low-bit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with sub-byte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74x on x86 platforms

arXiv.org e-Print Archive

Four Soviets Walk the Dog-Improved Bounds for Computing the Fr\'echet Distance

Author: Buchin Kevin
Buchin Maike
Meulemans Wouter
Mulzer Wolfgang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Given two polygonal curves in the plane, there are many ways to define a notion of similarity between them. One popular measure is the Fr\'echet distance. Since it was proposed by Alt and Godau in 1992, many variants and extensions have been studied. Nonetheless, even more than 20 years later, the original

O(n^2 \log n)

algorithm by Alt and Godau for computing the Fr\'echet distance remains the state of the art (here,

n

denotes the number of edges on each curve). This has led Helmut Alt to conjecture that the associated decision problem is 3SUM-hard. In recent work, Agarwal et al. show how to break the quadratic barrier for the discrete version of the Fr\'echet distance, where one considers sequences of points instead of polygonal curves. Building on their work, we give a randomized algorithm to compute the Fr\'echet distance between two polygonal curves in time

O(n^2 \sqrt{\log n}(\log\log n)^{3/2})

on a pointer machine and in time

O(n^2(\log\log n)^2)

on a word RAM. Furthermore, we show that there exists an algebraic decision tree for the decision problem of depth

O(n^{2-\varepsilon})

, for some

\varepsilon > 0

. We believe that this reveals an intriguing new aspect of this well-studied problem. Finally, we show how to obtain the first subquadratic algorithm for computing the weak Fr\'echet distance on a word RAM.Comment: 34 pages, 15 figures. A preliminary version appeared in SODA 201

arXiv.org e-Print Archive

Repository TU/e

Springer - Publisher Connector

Pure OAI Repository

A subquadratic algorithm for 3XOR

Author: Dietzfelbinger Martin
Schlag Philipp
Walzer Stefan
Publication venue
Publication date: 01/01/2018
Field of study

Given a set

X

n

binary words of equal length

w

, the 3XOR problem asks for three elements

a, b, c \in X

such that

a \oplus b=c

, where

\oplus

denotes the bitwise XOR operation. The problem can be easily solved on a word RAM with word length

w

in time

O(n^2 \log{n})

. Using Han's fast integer sorting algorithm (2002/2004) this can be reduced to

O(n^2 \log{\log{n}})

. With randomization or a sophisticated deterministic dictionary construction, creating a hash table for

X

with constant lookup time leads to an algorithm with (expected) running time

O(n^2)

. At present, seemingly no faster algorithms are known. We present a surprisingly simple deterministic, quadratic time algorithm for 3XOR. Its core is a version of the Patricia trie for

X

, which makes it possible to traverse the set

a \oplus X

in ascending order for arbitrary

a\in \{0, 1\}^{w}

in linear time. Furthermore, we describe a randomized algorithm for 3XOR with expected running time

O(n^2\cdot\min\{\log^3{w}/w, (\log\log{n})^2/\log^2 n\})

. The algorithm transfers techniques to our setting that were used by Baran, Demaine, and P\u{a}tra\c{s}cu (2005/2008) for solving the related int3SUM problem (the same problem with integer addition in place of binary XOR) in expected time

o(n^2)

. As suggested by Jafargholi and Viola (2016), linear hash functions are employed. The latter authors also showed that assuming 3XOR needs expected running time

n^{2-o(1)}

one can prove conditional lower bounds for triangle enumeration just as with 3SUM. We demonstrate that 3XOR can be reduced to other problems as well, treating the examples offline SetDisjointness and offline SetIntersection, which were studied for 3SUM by Kopelowitz, Pettie, and Porat (2016)

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Digitale Bibliothek Thüringen

Optimal Packed String Matching

Author: Ben-Kiki Oren
Bille Philip
Breslauer Dany
Gasieniec Leszek
Grossi Roberto
Weimann Oren
Publication venue: Schloss Dagstuhl-Leibniz-Zentrum fuer Informati
Publication date: 01/01/2011
Field of study

In the packed string matching problem, each machine word accommodates α characters, thus an n-character text occupies n/α memory words. We extend the Crochemore-Perrin constantspace O(n)-time string matching algorithm to run in optimal O(n/α) time and even in real-time, achieving a factor α speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC 0 instructions (i.e., no multiplication) plus two specialized AC 0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e., Intel’s SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC 0) and table lookup

CiteSeerX

Dagstuhl Research Online Publication Server

Online Research Database In Technology

Towards optimal packed string matching

Author: Aho
Aho
AMD
AMD
Apostolico
Arlazarov
Baeza-Yates
Belazzougui
Ben-Kiki
Ben-Nissan
Bille
Boyer
Breslauer
Breslauer
Breslauer
Breslauer
Breslauer
Brodnik
Cole
Cole
Commentz-Walter
Crochemore
Crochemore
Crochemore
Czumaj
Césari
Dany Breslauer
Daykin
Duval
Faro
Faro
Faro
Fich
Fine
Fischer
Fredriksson
Fredriksson
Furst
Galil
Galil
Goldberg
Gusfield
Gąsieniec
Iliopoulos
Intel
Intel
Intel
Knuth
Knuth
Leszek Ga̧sieniec
Lothaire
Muthukrishnan
Muthukrishnan
Muthukrishnan
Navarro
Oren Ben-Kiki
Oren Weimann
Philip Bille
Roberto Grossi
Rytter
Tarhio
Vishkin
Vishkin
Yao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

a r t i c l e i n f o a b s t r a c t Dedicated to Professor Gad M. Landau, on the occasion of his 60th birthday Keywords: String matching Word-RAM Packed strings In the packed string matching problem, it is assumed that each machine word can accommodate up to α characters, thus an n-character string occupies n/α memory words. The main word-size string-matching instruction wssm is available in contemporary commodity processors. The other word-size maximum-suffix instruction wslm is only required during the pattern pre-processing. Benchmarks show that our solution can be efficiently implemented, unlike some prior theoretical packed string matching work. (b) We also consider the complexity of the packed string matching problem in the classical word-RAM model in the absence of the specialized micro-level instructions wssm and wslm. We propose micro-level algorithms for the theoretically efficient emulation using parallel algorithms techniques to emulate wssm and using the Four-Russians technique to emulate wslm. Surprisingly, our bit-parallel emulation of wssm also leads to a new simplified parallel random access machine string-matching algorithm. As a byproduct to facilitate our results we develop a new algorithm for finding the leftmost (most significant) 1 bits in consecutive non-overlapping blocks of uniform size inside a word. This latter problem is not known to be reducible to finding the rightmost 1, which can be easily solved, since we do not know how to reverse the bits of a word in O (1) time

CiteSeerX

Crossref

Archivio della Ricerca - Università di Pisa

Online Research Database In Technology

Weighted Random Sampling - Alias Tables on the GPU

Author: Lehmann Hans-Peter
Publication venue: Karlsruher Institut für Technologie
Publication date: 28/05/2021
Field of study

KITopen

Image Display and Manipulation System (IDAMS) program documentation, Appendixes A-D

Author: Cecil R. W.
Szczur M. R.
White R. A.
Publication venue
Publication date
Field of study

The IDAMS Processor is a package of task routines and support software that performs convolution filtering, image expansion, fast Fourier transformation, and other operations on a digital image tape. A unique task control card for that program, together with any necessary parameter cards, selects each processing technique to be applied to the input image. A variable number of tasks can be selected for execution by including the proper task and parameter cards in the input deck. An executive maintains control of the run; it initiates execution of each task in turn and handles any necessary error processing

NASA Technical Reports Server