2,366 research outputs found
Fast Exact Bayesian Inference for Sparse Signals in the Normal Sequence Model
We consider exact algorithms for Bayesian inference with model selection
priors (including spike-and-slab priors) in the sparse normal sequence model.
Because the best existing exact algorithm becomes numerically unstable for
sample sizes over n=500, there has been much attention for alternative
approaches like approximate algorithms (Gibbs sampling, variational Bayes,
etc.), shrinkage priors (e.g. the Horseshoe prior and the Spike-and-Slab LASSO)
or empirical Bayesian methods. However, by introducing algorithmic ideas from
online sequential prediction, we show that exact calculations are feasible for
much larger sample sizes: for general model selection priors we reach n=25000,
and for certain spike-and-slab priors we can easily reach n=100000. We further
prove a de Finetti-like result for finite sample sizes that characterizes
exactly which model selection priors can be expressed as spike-and-slab priors.
The computational speed and numerical accuracy of the proposed methods are
demonstrated in experiments on simulated data, on a differential gene
expression data set, and to compare the effect of multiple hyper-parameter
settings in the beta-binomial prior. In our experimental evaluation we compute
guaranteed bounds on the numerical accuracy of all new algorithms, which shows
that the proposed methods are numerically reliable whereas an alternative based
on long division is not
Efficient algorithms for pairing-based cryptosystems
We describe fast new algorithms to implement recent cryptosystems based on the Tate pairing. In particular, our techniques improve pairing evaluation speed by a factor of about 55 compared to previously known methods in characteristic 3, and attain performance comparable
to that of RSA in larger characteristics.We also propose faster algorithms for scalar multiplication in characteristic 3 and square root extraction
over Fpm, the latter technique being also useful in contexts other than that of pairing-based cryptography
Wavemoth -- Fast spherical harmonic transforms by butterfly matrix compression
We present Wavemoth, an experimental open source code for computing scalar
spherical harmonic transforms (SHTs). Such transforms are ubiquitous in
astronomical data analysis. Our code performs substantially better than
existing publicly available codes due to improvements on two fronts. First, the
computational core is made more efficient by using small amounts of precomputed
data, as well as paying attention to CPU instruction pipelining and cache
usage. Second, Wavemoth makes use of a fast and numerically stable algorithm
based on compressing a set of linear operators in a precomputation step. The
resulting SHT scales as O(L^2 (log L)^2) for the resolution range of practical
interest, where L denotes the spherical harmonic truncation degree. For low and
medium-range resolutions, Wavemoth tends to be twice as fast as libpsht, which
is the current state of the art implementation for the HEALPix grid. At the
resolution of the Planck experiment, L ~ 4000, Wavemoth is between three and
six times faster than libpsht, depending on the computer architecture and the
required precision. Due to the experimental nature of the project, only
spherical harmonic synthesis is currently supported, although adding support or
spherical harmonic analysis should be trivial.Comment: 13 pages, 6 figures, accepted by ApJ
Multiple point compression on curves
Multiple point compression is an important feature to improve the implementation of
elliptic curve cryptography. This can be extended to other curves, in particular hyperelliptic curves, with
divisors represented in Mumford form
์จ-๋๋ฐ์ด์ค ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง ์ฐ์ฐ ๊ฐ์๊ธฐ๋ฅผ ์ํ ๊ณ ์ฑ๋ฅ ์ฐ์ฐ ์ ๋ ์ค๊ณ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ๊นํํ.Optimizing computing units for an on-device neural network accelerator can bring less energy and latency, more throughput, and might enable unprecedented new applications. This dissertation studies on two specific optimization opportunities of multiplyaccumulate (MAC) unit for on-device neural network accelerator stem from precision quantization methodology.
Firstly, we propose an enhanced MAC processing unit structure efficiently processing mixed-precision model with majority operations with low precision. Precisely, two essential works are: (1) MAC unit structure supporting two precision modes is designed for fully utilizing its computation logic when processing lower precision data, which brings more computation efficiency for mixed-precision models whose major operations are in lower precision; (2) for a set of input CNNs, we formulate the exploration of the size of a single internal multiplier in MAC unit to derive an economical instance, in terms of computation and energy cost, of MAC unit structure across the whole network layers. Experimental results with two well-known CNN models, AlexNet and VGG-16, and two experimental precision settings showed that proposed units can reduce computational cost per multiplication by 4.68โผ30.3% and save energy cost by 43.3% on average over conventional units.
Secondly, we propose an acceleration technique for processing multiplication operations using stochastic computing (SC). MUX-FSM based SC, which employs a MUX controlled by an FSM to generate a bit sequence of a binary number to count up for a MAC operation, considerably reduces the hardware cost for implementing MAC operations over the traditional stochastic number generator (SNG) based SC. Nevertheless, the existing MUX-FSM based SC still does not meet the multiplication processing time required for a wide adoption of on-device neural networks in practice even though it offers a very economical hardware implementation. Also, conventional enhancements have their limitation for sub-maximal cycle reduction, parameter conversion cost, etc. This work proposes a solution to the problem of further speeding up the conventional MUX-FSM based SC. Precisely, we analyze the bit counting pattern produced by MUX-FSM and replace the counting redundancy by shift operation, resulting in reducing the length of the required bit sequence significantly, theoretically speeding up the worst-case multiplication processing time by 2X or more. Through experiments, it is shown that our enhanced SC technique is able to shorten the average processing time by 38.8% over the conventional MUX-FSM based SC.์จ-๋๋ฐ์ด์ค ์ธ๊ณต ์ ๊ฒฝ๋ง ์ฐ์ฐ ๊ฐ์๊ธฐ๋ฅผ ์ํ ์ฐ์ฐ ํ๋ก ์ต์ ํ๋ ์ ์ ๋ ฅ, ์ ์ง์ฐ์๊ฐ, ๋์ ์ฒ๋ฆฌ๋, ๊ทธ๋ฆฌ๊ณ ์ด์ ์ ๋ถ๊ฐํ์๋ ์๋ก์ด ์์ฉ์ ๊ฐ๋ฅ์ผ ํ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์จ-๋๋ฐ์ด์ค ์ธ๊ณต ์ ๊ฒฝ๋ง ์ฐ์ฐ ๊ฐ์๊ธฐ์ ๊ณฑ์
-๋์ ํฉ ์ฐ์ฐ๊ธฐ(MAC)์ ๋ํด ์ ๋ฐ๋ ์์ํ ๊ธฐ๋ฒ ์ ์ฉ ๊ณผ์ ์์ ํ์ํ ๋ ๊ฐ์ง ํน์ ํ ์ต์ ํ ๋ฌธ์ ์ ๋ํด ๋
ผ์ํ๋ค.
์ฒซ ๋ฒ์งธ๋ก, ๋ฎ์ ์ ๋ฐ๋ ์ฐ์ฐ์ด ๋๋ค์๋ฅผ ์ฐจ์งํ๋๋ก ์ค๋น๋ ๋ค์ค ์ ๋ฐ๋๊ฐ ์ ์ฉ๋ ๋ชจ๋ธ์ ํจ์จ์ ์ผ๋ก ์ฒ๋ฆฌํ๊ธฐ ์ํด ๊ฐ์ ๋ MAC ์ฐ์ฐ ์ ๋ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ๋ค์ ๋ ๊ฐ์ง ๊ธฐ์ฌ์ ์ ์ ์ํ๋ค: (1) ์ ์ํ ๋ ๊ฐ์ง ์ ๋ฐ๋ ๋ชจ๋๋ฅผ ์ง์ํ๋ MAC ์ ๋ ๊ตฌ์กฐ๋ ๋ฎ์ ์ ๋ฐ๋ ๋ฐ์ดํฐ๋ฅผ ์ฐ์ฐํ ๋ ์ ๋์ ์ฐ์ฐ ํ๋ก๋ฅผ ์ต๋ํ ํ์ฉํ๋๋ก ์ค๊ณ๋๋ฉฐ, ๋ฎ์ ์ ๋ฐ๋ ์ฐ์ฐ ๋น์จ์ด ๋๋ค์๋ฅผ ์ฐจ์งํ๋ ๋ค์ค ์ ๋ฐ๋ ์ฐ์ฐ ๋ชจ๋ธ์ ๋ ๋์ ์ฐ์ฐ ํจ์จ์ ์ ๊ณตํ๋ค; (2) ์ฐ์ฐ ๋์ CNN ๋คํธ์ํฌ์ ๋ํด, MAC ์ ๋์ ๋ด๋ถ ๊ณฑ์
๊ธฐ์ `๊ฒฝ์ ์ ์ธ' (๋นํธ) ํฌ๊ธฐ๋ฅผ ํ์ํ๊ธฐ ์ํ ๋น์ฉ ํจ์๋ฅผ, ์ ์ฒด ๋คํธ์ํฌ ๋ ์ด์ด๋ฅผ ์ฐ์ฐ ๋์์ผ๋ก ํ์ฌ ์ฐ์ฐ ๋น์ฉ๊ณผ ์๋์ง ๋น์ฉ ํญ์ผ๋ก ๋ํ๋๋ค. ๋๋ฆฌ ์๋ ค์ง AlexNet๊ณผ VGG-16 CNN ๋ชจ๋ธ์ ๋ํ์ฌ, ๊ทธ๋ฆฌ๊ณ ๋ ๊ฐ์ง ์คํ ์ ์ ๋ฐ๋ ๊ตฌ์ฑ์ ๋ํ์ฌ, ์คํ ๊ฒฐ๊ณผ ์ ์ํ ์ ๋์ด ๊ธฐ์กด ์ ๋ ๋๋น ๋จ์ ๊ณฑ์
๋น ์ฐ์ฐ ๋น์ฉ์ 4.68~30.3% ์ ๊ฐํ์์ผ๋ฉฐ ์๋์ง ๋น์ฉ์ 43.3% ์ ๊ฐํ์๋ค.
๋ ๋ฒ์งธ๋ก, ์คํ ์บ์คํฑ ์ปดํจํ
(SC) ๊ธฐ๋ฐ MAC ์ฐ์ฐ ์ ๋์ ์ฐ์ฐ ์ฌ์ดํด ์ ๊ฐ์ ์ํ ๊ธฐ๋ฒ ๋ฐ ์ฐ๊ด๋ ํ๋์จ์ด ์ ๋ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๋ค. FSM์ผ๋ก ์ ์ด๋๋ MUX๋ฅผ ํตํด ์
๋ ฅ ์ด์ง์์์ ๋ง๋ ๋นํธ ์์ด์ ์ธ์ด MAC ์ฐ์ฐ์ ๊ตฌํํ๋ MUX-FSM ๊ธฐ๋ฐ SC๋ ๊ธฐ์กด ์คํ ์บ์คํฑ ์ซ์ ์์ฑ๊ธฐ ๊ธฐ๋ฐ SC ๋๋น ํ๋์จ์ด ๋น์ฉ์ ์๋นํ ์ค์ผ ์ ์๋ค. ๊ทธ๋ฌ๋ ํ์ฌ MUX-FSM ๊ธฐ๋ฐ SC๋ ํจ์จ์ ์ธ ํ๋์จ์ด ๊ตฌํ๊ณผ ๋ณ๊ฐ๋ก ์ฌ์ ํ ๋ค์์ ์ฐ์ฐ ์ฌ์ดํด์ ์๊ตฌํ์ฌ ์จ-๋๋ฐ์ด์ค ์ ๊ฒฝ๋ง ์ฐ์ฐ๊ธฐ์ ์ ์ฉ๋๊ธฐ ์ด๋ ค์ ๋ค. ๋ํ, ๊ธฐ์กด์ ์ ์๋ ๋์์ ์ ๊ฐ๊ธฐ ์ ๊ฐ ํจ๊ณผ์ ํ๊ณ๊ฐ ์๊ฑฐ๋ ๋ชจ๋ธ ๋ณ์ ๋ณํ ๋น์ฉ์ด ์๋ ๋ฑ ํ๊ณ์ ์ด ์์๋ค. ์ ์ํ๋ ๋ฐฉ๋ฒ์ ๊ธฐ์กด MUX-FSM ๊ธฐ๋ฐ SC์ ์ถ๊ฐ ์ฑ๋ฅ ํฅ์์ ์ํ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. MUX-FSM ๊ธฐ๋ฐ SC์ ๋นํธ ์ง๊ณ ํจํด์ ํ์
ํ๊ณ , ์ค๋ณต ์ง๊ณ๋ฅผ ์ํํธ ์ฐ์ฐ์ผ๋ก ๊ต์ฒดํ์๋ค. ์ด๋ก๋ถํฐ ํ์ ๋นํธ ํจํด์ ๊ธธ์ด๋ฅผ ํฌ๊ฒ ์ค์ด๋ฉฐ, ๊ณฑ์
์ฐ์ฐ ์ค ์ต์
์ ๊ฒฝ์ฐ์ ์ฒ๋ฆฌ ์๊ฐ์ ์ด๋ก ์ ์ผ๋ก 2๋ฐฐ ์ด์ ํฅ์ํ๋ ๊ฒฐ๊ณผ๋ฅผ ์ป์๋ค. ์คํ ๊ฒฐ๊ณผ์์ ์ ์ํ ๊ฐ์ ๋ SC ๊ธฐ๋ฒ์ด ๊ธฐ์กดMUX-FSM ๊ธฐ๋ฐ SC ๋๋น ํ๊ท ์ฒ๋ฆฌ ์๊ฐ์ 38.8% ์ค์ผ ์ ์์๋ค.1 INTRODUCTION 1
1.1 Neural network accelerator and its optimizations 1
1.2 Necessity of optimizing computational block of neural network accelerator 5
1.3 Contributions of This Dissertation 7
2 MAC Design Considering Mixed Precision 9
2.1 Motivation 9
2.2 Internal Multiplier Size Determination 14
2.3 Proposed hardware structure 16
2.4 Experiments 21
2.4.1 Implementation of Reference MAC units 23
2.4.2 Area, Wirelength, Power, Energy, and Performance of MAC units for AlexNet 24
2.4.3 Area, Wirelength, Power, Energy, and Performance of MAC units for VGG-16 31
2.4.4 Power Saving by Clock Gating 35
3 Speeding up MUX-FSM based Stochastic Computing Unit Design 37
3.1 Motivations 37
3.1.1 MUX-FSM based SC and previous enhancements 42
3.2 The Proposed MUX-FSM based SC 48
3.2.1 Refined Algorithm for Stochastic Computing 48
3.3 The Supporting Hardware Architecture 55
3.3.1 Bit Counter with shift operation 55
3.3.2 Controller 57
3.3.3 Combining with preceding architectures 58
3.4 Experiments 59
3.4.1 Experiments Setup 59
3.4.2 Generating input bit selection pattern 60
3.4.3 Performance Comparison 61
3.4.4 Hardware Area and Energy Comparison 63
4 CONCLUSIONS 67
4.1 MAC Design Considering Mixed Precision 67
4.2 Speeding up MUX-FSM based Stochastic Computing Unit Design 68
Abstract (In Korean) 73Docto
- โฆ