6 research outputs found

    Fast Inner Product Computation on Short Buses

    Get PDF
    We propose a VLSI inner product processor architecture involving broadcasting only over short buses (containing less than 64 switches). The architecture leads to an efficient algorithm for the inner product computation. Specifically, it takes 13 broadcasts, each over less than 64 switches, plus 2 carry-save additions (tcsa) and 2 carry-lookahead additions (tcla) to compute the inner product of two arrays of N = 29 elements, each consisting of m = 64 bits. Using the same order of VLSI area, our algorithm runs faster than the best known fast inner product algorithm of Smith and Torng [ Design of a fast inner product processor, Proceedings of IEEE 7th Symposium on Computer Arithmetic (1985)], which takes about 28 tcsa + tcla for the computation

    Xeon Phi๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 8. ์œค์„ฑ๋กœ.๋ณธ ๋…ผ๋ฌธ์€ Intel์˜ ๋ณ‘๋ ฌ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋ณด์กฐ ํ”„๋กœ์„ธ์‹ฑ ์žฅ์น˜(Co-Processor)์ธ Xeon Phi๋ฅผ ์ด์šฉํ•œ ๊ฐ€์†ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„ ๋ฐ ๊ตฌํ˜„์ƒ ์ฃผ์š” ์ด์Šˆ์™€ Xeon Phi๋กœ ๊ตฌํ˜„ ๊ฐ€๋Šฅํ•œ ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ํšจ๊ณผ์ ์ธ ๋ณ‘๋ ฌํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•ด์„œ๋Š” ์‚ฌ์šฉํ•˜๊ณ ์žํ•˜๋Š” ๊ฐ€์†๊ธฐ์˜ ๊ตฌ์กฐ์ ์ธ ํŠน์ง•๊ณผ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๊ฐœ๋ฐœ์ž๊ฐ€ ๊ตฌ์ฒด์ ์ธ ์ดํ•ด๋ฅผ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ€์†๊ธฐ๋กœ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•œ ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ์„ ํŒŒ์•…ํ•˜์—ฌ ๋Œ€์ƒ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์— ํšจ๊ณผ์ ์ธ ์ ์šฉ์„ ํ•˜์—ฌ์•ผํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์€ ๋จผ์ € ํšจ๊ณผ์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•ด ์•Œ์•„๋‘์–ด์•ผ ํ•  Xeon Phi์˜ ๊ตฌ์กฐ์  ํŠน์ง•, ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ์„ค๋ช…ํ•œ๋‹ค. ๋”๋ถˆ์–ด ํ”„๋กœ๊ทธ๋ž˜๋จธ ๊ด€์ ์—์„œ GPGPU (General-Purpose computing on Graphics Processing Units)์™€์˜ ์ฐจ์ด๋„ ์–ธ๊ธ‰ํ•˜์—ฌ ์ด๋ฏธ GPGPU์— ์ต์ˆ™ํ•ด์ง„ ๊ฐœ๋ฐœ์ž๋„ Xeon Phi์— ์‰ฝ๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋‹ค์Œ์œผ๋กœ Xeon Phi๋กœ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ์„ ์ •์˜ํ•˜๊ณ  ๊ตฌ์ฒด์ ์ธ ์‚ฌ๋ก€์™€ ํ•จ๊ป˜ ์ œ์‹œํ•œ๋‹ค. ๊ธฐ์กด์— ๋ฉ€ํ‹ฐ์ฝ”์–ด CPU (Central Processing Unit)๋ฅผ ์œ„ํ•ด ๊ตฌํ˜„์ด ๋˜์—ˆ๋˜ Strassen-Winograd ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ SHRINK (SHaRed-memory SLINK)์•Œ๊ณ ๋ฆฌ์ฆ˜์— Xeon Phi๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ตฌํ˜„ํ•œ ๋’ค ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ€์†ํ™” ํšจ๊ณผ๋ฅผ ๋ณด์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด Xeon Phi๋Š” GPGPU์™€ ๋™์ผํ•œ ํ˜•ํƒœ์˜ ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ GPGPU๋กœ๋Š” ํšจ๊ณผ์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ ํž˜๋“  ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ๋„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์ฒด์ ์ธ ์‚ฌ๋ก€์™€ ํ•จ๊ป˜ ๋ณธ ๋…ผ๋ฌธ์— ์†Œ๊ฐœ๋˜์ง€ ์•Š์€ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—๋„ ๊ฐ€์†ํ™”๋ฅผ ์œ„ํ•ด ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‘ ๊ฐ€์ง€ ๋ณ‘๋ ฌํ™” ๋ชจ๋ธ์„ ์ •์˜ํ•ด ๊ฐœ๋ฐœ์ž๊ฐ€ ์ฃผ์–ด์ง„ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ํšจ๊ณผ์ ์œผ๋กœ Xeon Phi๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 2 ์žฅ Xeon Phi ์†Œ๊ฐœ 3 ์ œ 1 ์ ˆ ๊ฐœ์š” ๋ฐ ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ 3 ์ œ 2 ์ ˆ ๊ตฌ์กฐ์  ํŠน์ง• 5 ์ œ 3 ์ ˆ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ 9 ์ œ 4 ์ ˆ GPGPU ๋Œ€๋น„ ํŠน์ง• 11 ์ œ 3 ์žฅ Xeon Phi๋ฅผ ํ†ตํ•œ ๊ฐ€์†ํ™” 15 ์ œ 1 ์ ˆ Simple ๋ชจ๋ธ ๊ฐ€์†ํ™” 17 1. Strassen-Winograd ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์†Œ๊ฐœ 18 2. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ€์†ํ™” 20 3. Xeon Phi์˜ ์ ์šฉ 21 ์ œ 2 ์ ˆ Complex ๋ชจ๋ธ ๊ฐ€์†ํ™” 22 1. SHRINK (SHaRed-memory SLINK) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์†Œ๊ฐœ 23 2. Xeon Phi์˜ ์ ์šฉ 25 3. Hybrid SHRINK 26 ์ œ 4 ์žฅ ์‹ค ํ—˜ 29 ์ œ 1 ์ ˆ Strassen-Winograd ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ€์†ํ™” 29 1. ์‹คํ—˜ ํ™˜๊ฒฝ 29 2. ์‹คํ—˜ ๊ฒฐ๊ณผ 30 ์ œ 2 ์ ˆ SHRINK ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ€์†ํ™” 33 1. ์‹คํ—˜ ํ™˜๊ฒฝ 33 2. ์‹คํ—˜ ๊ฒฐ๊ณผ 34 ์ œ 4 ์žฅ ๊ฒฐ ๋ก  38 ์ฐธ๊ณ ๋ฌธํ—Œ 40 Abstract 47Maste

    Ultra high-speed transaxial image reconstruction of the heart, lungs, and circulation via numerical approximation methods and optimized processor architecture

    Full text link
    A high temporal resolution scanning multiaxial tomography unit, the Dynamic Spatial Reconstructor (DSR), presently under development will be capable of recording multiangular X-ray projection data of sufficient axial range to reconstruct a cylindrical volume consisting of up to 240 contiguous 1-mm thick cross sections encompassing the intact thorax. At repetition rates of up to 60 sets of cross sections per second, the DSR will thus record projection data sufficient to reconstruct as many as 14 400 cross-sectional images during each second of operation. Use of this system in a clinical setting will be dependent upon the development of software and hardware techniques for carrying out X-ray reconstructions at the rate of hundreds of cross sections per second. A conceptual design, with several variations, is proposed for a special purpose hardware reconstruction processor capable of completing a single cross section reconstruction within 1 to 2 msec. In addition, it is suggested that the amount of computation required to execute the filtered back-projection algorithm may be decreased significantly by the utilization of approximation equations, formulated as recursions, for the generation of internal constants required by the algorithm. The effects on reconstructed image quality of several different approximation methods are investigated by reconstruction of density projections generated from a mathematically simulated model of the human thorax, assuming the same source-detector geometry and X-ray flux density as will be employed by the DSR. These studies have indicated that the prudent application of numerical approximations for the generation of internal constants will not cause significant degradation in reconstructed image quality and will in fact require substantially less auxiliary memory and computational capacity than required by direct execution of mathematically exact formulations of the reconstruction algorithm.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/23631/1/0000595.pd
    corecore