




















Uniﬁed Device Architecture)??? GPU(Graphic Processing Unit)??????????????










??????????????? 1970??? Illyac IV????????????????????
??????????????????1990??? CM-5? CRAY T3D????????????
?? [2]???? IBM? BlueGene [3]???????????? TOP500????????????
???Google???????????????????????????????????????
MapReduce????????????????????? [4]?????????????? LAN?










? (Heterogeneous)????Intel Core 2? Sparc Niagara????????????????????
Cell B.E. (Cell Broadband Engine) [7]?????????????GPU (Graphic Processing Unit)??






































































































POSIX Thread?????????????????????? pthread create??? pthread join?
?????
#include <pthread.h>
int pthread_create(pthread_t * thread, pthread_attr_t * attr,
void * (*start_routine)(void *), void * arg);
?????????arg ?? 1 ????? start routine ??????????????????





int pthread_join(pthread_t th, void **thread_return);
pthread join????????????????????th?????????? pthread exit??























???????? i?????????? id??? id[i]?????????? sub????????
printf??????????
???????????? pthread create?????????












int x[100]; // ????????????
int ps[4]; // ????????????????????
typedef struct _parg_t{
229?????????????????
int id; // ???????????












































int x[100]; // ????????????
int ps[4]; // ????????????????????
int total=0;
typedef struct _parg_t{
int id; // ??????








































2. ???? 1? total????? (total=0)
3. ???? 1? ps[1] (=100)????? (total=100)
4. ???? 1? total????? (total=100)
5. ???? 2? total????? (total=100)
6. ???? 2? ps[2] (=200)????? (total=300)
7. ???? 2? total????? (total=300)
?????????????????????
1. total=0???
2. ???? 1? total????? (total=0)
3. ???? 1? ps[1] (=100)????? (total=100)
4. ???? 2? total????? (total=0)
5. ???? 1? total????? (total=100)
6. ???? 2? ps[2] (=200)????? (total=200)











int x[100]; // ????????????
int ps[4]; // ????????????????????
int total=0;
typedef struct _parg_t{
int id; // ??????































Intel Xeon? 2????????????????? 1??????
? 1: ????
CPU ?????? ??????????? ???
Xeon E5335 (2.0GHz) 2 4 4GB
????????????? 8????????????????? 8????????????





























? Intel Harpertown (Xeon)? Linpack Peak? 102GLOPS????????NVIDIA GeForce GTX 200
GPU????????? 1TFLOPS (=1000 GFLOPS)?????????? GPU????????
??????????????????????????????????????????????
???????????????????????NVIDIA?GeForce8??????????GPU











? API??????????????????? GeForce 8??? NVIDIA? GPU????
GPU????????????global memory?constant memory?texture memory?shared memory?
local memory? 5????????8?? SP (Streaming Processor)???MP (Multi Processor)??
??????????????????????????????? local memory????????
??????????????????? shared memory???????????????????
????????????CPU?GPU? global memory?????????????? local memory




1. CPU?????? GPU? global memory???????
2. GPU?????
3. GPU???????





__global__ void cuda002Kernel( float* g_idata, float* g_odata)
{
const unsigned int tid = threadIdx.x; // ???? ID???
g_odata[tid] = g_idata[tid]; //?????????????
}
int main( int argc, char** argv)
{
CUT_DEVICE_INIT();?//????????
//???????? float?????? 100?? 2?????
float* h_idata = (float*) malloc(sizeof( float) * 100);
float* h_odata = (float*) malloc(sizeof( float) * 100);





cudaMalloc( (void**) &d_idata, sizeof( float) * 100 );
cudaMalloc( (void**) &d_odata, sizeof( float) * 100);
//???????? GPU????????????????
cudaMemcpy( d_idata, h_idata, sizeof( float) * 100 ,
cudaMemcpyHostToDevice);
//??? GPU???????????
dim3 grid( 1, 1, 1); //(1,1,1)??????
dim3 threads(100, 1, 1); //100????
cuda002Kernel<<< grid, threads>>>( d_idata, d_odata);
//GPU?????????????????????





??? CPU???h idata? h odata????malloc??h idata?????????????GPU
????????? d idata? d odata?????????????1???????? 100?????
??????????????????????cuda002Kernel????????????????






? 2: NVIDIA 8600 GT???
no. of MP Shader clk. Core clk.
4 1180MHz 540MHz
????????????? (VQ)????????????????PNN (Pairwise Nearest Neigh-






????????? GPU? NVIDIA 8600 GT???????? 2?????????8600 GT?
???????? GPU? 8600 GT?????????????????????
8600 GT ???MP (Multi Processor) ? 4 ?????MP ?? 8 ?? SP ???????? 32 ?
?????????????? 3 ??? 1 ?????????????????????? 51.8










































256 ths. 4 grids
128 ths. 4 grids
64 ths. 4 grids
16 ths. 4 grids
? 5: 8600 GT???????
? 5?? 1?????????????grid???? CUDA?????????????????
8600 GT?? 4??MP??????grid? 4???????????? (? 5?? grid? 4???
???)???????????????????? (T)??????? CPU??????????






????? distance????????????? global memory????????????????
?????????????????????????global memory?????????????
????????????????????????

















256 ths. 4 grids
128 ths. 4 grids
64 ths. 4 grids
16 ths. 4 grids
? 6: shared memory???????????
??????????????????????????MP?????????????????









[24]??????????????????????????Cell B.E. (Cell Broadband Engine)??
PLAYSTATION 3 (PS3)???? SONY?IBM????????????????????????
239?????????????????
????????????????????????????? [7]?Cell B.E.?? Power6????

















??????256 Mbyte?XDR DRAM??????????????????Cell B.E.??Power6
??? PPE? 8?? SPE????????????????????????????SPE????


















? DMA????????????????????????? 7????? DMA????????
spu_mfcdma64(&x[0], mfc_ea2h(ea),
mfc_ea2l(ea),size, tag, MFC_GET_CMD); // (1)




MFC GET CMD????? LS??????MFC PUT CMD????? LS??????????
?????????(2)??????“tag”??????????????????????????
(3)?????????????????(3)?????DMA??????????????????
??(1), (2), (3)?????????????(1)? (2)????? SPE??????????DMA?
?? SPE??????????????tag???????????????DMA???????
???????????










ret = spe_context_run(arg->spe, &entry, 0,





prog = spe_image_open(spefile); //(a)
spe = spe_context_create(0, NULL); //(b)
241?????????????????
spe_program_load(spe, prog); //(c)










????? (a)? spe image open??? SPE?????????????????????????
??????(b)? spe contect create??? SPE????????????(c)? spe program load?
?? SPE?????????????????????????????????????????
????? (e)? spe context run??????????????????????????????
????????????????????? SPE??????????(d)???? pthread create
?????????????????????????(f)? SPE?????????
???SPE????????????????????
int main(unsigned long long spe, unsigned long long argp)
{
/* ???????????????? */
/* DMA Transfer 1 : GET input parameters */
spu_mfcdma64(&saxpy_params, mfc_ea2h(argp), mfc_ea2l(argp),
sizeof(saxpy_params_t), tag, MFC_GET_CMD); // (a)








/* DMA Transfer 2 : PUT results */
spu_mfcdma64(&ly[0], mfc_ea2h(ea), mfc_ea2l(ea),
sizeof(float)*BSIZE, tag, MFC_PUT_CMD); // (d)
242 ????





(a), (b), (c)???????? LS????????????(a)?MFC GET CMD????????
?????DMA???????(b)??? (c)??????????????????SPE????
??????????? (d), (e), (f)???? LS???? PPE?????????????????
??MFC PUT CMD????????????????
3.2.5 ???
???? BLAS level 1 ? SAXPY?????????Cell B.E.????????????DMA
??????????????????? [26]????????????? 3??????
? 3: ????
platform Play Station 3
CPU Cell B.E. (3.02GHz)
memory 256 Mbyte
HDD 60 Gbyte




level 1, 2, 3????????level 1????? SAXPY?????
yi = α · xi+ yi (0≤ i< n) (1)
??????????????????????????????????????????????
?????????????????????????????????Cell B.E.?? PPE?? SPE












? 830 MFLOPS??????????????? 105????VMX????? 355 MFLOPS??















































































SPE with full overlap
PPE
PPE with VMX
(c) size=106 (d) size=107
? 8: ?? (DMA size = 16000 byte)
?????? 106???SPE1???? 2?????PPE??????????????????













yi = α1 · xi+α1 · yi+α2 · xi+α2 · yi
+ α3 · xi+α3 · yi+α4 · xi+α4 · yi
+ xi · yi (0≤ i< n) (2)
??? SAXPY??????????????????? iterarion???????????? 2?



















SPE with full overlap
PPE
PPE with VMX
? 9: size=107,DMA size = 16000 byte
??? SAXPY????????SPE??????????????????????????
SPE??????SPE? 6?? 9.7 GLOPS?????????? DMA????????????





DMA????????????????SPE1?? 107???? SAXPY????????? 551
MFLOPS???????????????DMA???????????? 782 MFLOPS?????
????????????? 998 MFLOPS??????? 2?????????????????
????? 9???????????????? 2.4 GFLOPS???????????????DMA




















[1] David A. Patterson, John L. Hennessy, Peter J. Ashenden and James R. Larus, Computer Organization
and Design: The Hardware/Software Interface. Morgan Kaufmann Pub., 2004.
[2] David E. Culler, Jaswinder Pal Singh and Anoop Gupta, Parallel Computer Architecture: A Hard-
ware/Software Approach. Morgan Kaufmann Pub., 1998.
[3] Blue Gene,
http://www.research.ibm.com/bluegene/
[4] MapReduce: Simpliﬁed Data Processing on Large Clusters,
http://labs.google.com/papers/mapreduce.html
[5] Message Passing Interface Forum,
http://www.mpi-forum.org/
[6] David A. Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach. Morgan
Kaufmann Pub., 2006.
246 ????
[7] Cell Broadband Engine,
http://cell.scei.co.jp/index e.html
[8] Hubert Nguyen, GPU Gems 3, Addison-Wesley Pub., 2007.
[9] Geraint Jones and Michael Goldsmith, Programming in Occam 2. Prentice Hall, 1988.
[10] Yuri Dotsenko, Cristian Coarfa and John Mellor-Crummey, “AMulti-platform Co-Array Fortran Com-
piler,” in Proc. of the 13th International Conference of Parallel Architectures and Compilation Tech-
niques, pp. 29-40, 2004.
[11] Michael Joseph Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1995.
[12] B. Nichols, D. Buttlar and J. Farrell, Pthreads Programming: A POSIX Standard for Better Multipro-
cessing. O’Reilly Media, 1996.
[13] Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menom and Rohit Chandra,
Parallel Programming in OpenMP. Morgan Kaufmann Pub., 2000.
[14] MPICH-A Portable Implementation of MPI,
http://www-unix.mcs.anl.gov/mpi/mpich1/






[18] Peter Messmer, Paul J. Mullowney and Brian E. Granger, “GPULib: GPU Computing in High-Level
Languages,” IEEE Computing in Science & Engineering, vol. 10, no. 5, pp. 70-73, 2008.
[19] W. Equitz, “A New Vector Quantization Clustering Algorithm,” IEEE Trans. on Acoustics, Speech and
Signal Processing, vol. 37, no. 10, pp. 1568-1575, 1980.
[20] A. Gersho and R. Gray, Vector Quantization and Signal Compression. Kluwer Academic Pub., 1992.
[21] Jeng-Shyang Pan, Zhe-Ming Lu, and Sheng-He Sun, “An Efﬁcient Encoding Algorithm for Vector
Quantization Based on Subvector Technique,” IEEE Trans. on image processing, vol. 12, no. 3, pp.
265-270, 2003.
[22] R.M. Gray, “Vector Quantization,” IEEE ASSP Magazine, vol. 1, pp. 4-29, 1984.
[23] Y. Linde, A. Buzo, and R. M. Gray, “An Algorithm for Vector Quantizer Design, ” IEEE Trans.
Commun., vol. 28, no. 1, pp. 84-95, 1980.
247?????????????????
[24] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob,
S. Jain, S. Venkataraman, Y. Hoskote and N. Borkar, “An 80-tile 1.28TFLOPS Network-on-chip in
65nm CMOS,” in Proc. ISSCC2007, p. 98, 2007.
[25] Jakub Kurzak, Alfredo Buttari, Piotr Luszczek and Jack Dongarra, “The PlayStation 3 for High-
Performance Scientiﬁc Computing,” IEEE Computing in Science & Engineering, vol. 10, no. 3, pp.
84-87, 2008.
[26] BLAS (Basic Linear Algebra Subprograms),
http://www.netlib.org/blas/
[27] A. Wakatani, “A Parallel and Scalable Algorithm for ADI Method with Pre-propagation and Message
Vectorization,” Parallel Computing, vol. 30, no. 12, pp. 1345-1359, 2004.
