Research on inter prediction algorithms and architectures for high-performance video codec VLSI by Zhou Jinjia
  
 
早稲田大学大学院情報生産システム研究科 
 
 
 
 
博 士 論 文 概 要 
 
 
 
 
 
 
 
 
論  文  題  目  
 
Research on Inter Prediction Algorithms and 
Architectures for High-Performance Video Codec VLSI 
 
 
 
 
 
 
 
 
 
 
 
 
申  請  者 
 
Jinjia ZHOU 
 
情報生産システム工学専攻 
マルチメディアシステム研究 
 
 
 
2012 年 12 月
 2 
    Whi le  1080 HD i s  the  current  s tandard o f  mains tream v ideo  
appl icat ions  such  as  TV broadcas t ing ,  even h igher  spec iﬁcat ions  such  as  
4K Ultra  HD have  been targeted  by  next -generat ion appl i cat ions .  To  s tore  
and  t ransmit  these  mass  v ideo  contents ,  v ideo  compress ion  is  
ind ispensable .  From M PEG-1/2 /4  to  H.264/AVC and HEVC etc . ,  the 
cont inuous  innovat ion  in  this  area  has  been  a  s igni f i cant  s t imulat ion  o f  
the  popular izat ion o f  mul t imedia  in  modern l i f e .  In  almost  a l l  these  
compress ion s tandards ,  inter  pred ic t ion i s  one  o f  the  most  important  
cod ing too ls ,  which s igni f i cant ly  contr ibutes  to  coding  e f f i c iency  by 
exp lor ing  the  tempora l  data  redundancy  between neighbor ing  f rames .  In  
the  meanwhile ,  inter  pred ic t ion  a lso  invo lves  h igh  complex i ty  and hug  
memory bandwidth.  In  the  v ideo  encoder  and decode r,  inter  pred i ct i on i s  
rea l i zed  by mot ion est imat ion (ME)  and mot ion compensat ion (MC) ,  
respect ive ly.  
    ME searches  neighbor ing  re ference  f rames  to  f ind  the  p ixel  b locks  
which  best  match the  b locks  in  the  current  f rame.  As  a  resul t ,  on ly  the  
blocks ’ d i f f erences  a long  with  a  se t  o f  d isp lacements  ca l l ed  mot ion  vec tor  
(MV)  are  required  to  encode  the  current  f rame.  Most  o f  the  hardware  M E 
architectures ,  espec ia l ly  those  implemented in  recent ly  publ ished v ideo  
encoder  ch ips  (Y.K .  L in;  ISSCC2008,  L .F.  Dong ;  ISS CC2009)  are  based  on  
fu l l  s earch or  modiﬁed vers ions  o f  fu l l  s earch .  In  order  to  ﬁnd the  best  
matching b lock ,  fu l l  s earch ME checks  al l  po ints  in  the  search area  which 
l eads  to  huge  computat ional  complex ity.  In  th is  d i ssertat ion,  we  present  
the  al ternat ing  asymmetr i c  search  range ass ignment  (AASRA)  schemes  
inc luding  AASRA -B,  AASRA-P and  AASRA-PB to  reduce  the  complex i ty  o f  
fu l l  search  ME whi le  mainta in ing cod ing per formance .   
    MC ut i l i zes  MVs  to  l ocate  the  matching b locks ,  and then  compensates  
the  b locks ’ d i f f erences  to  decode  the  current  b locks .  VLSI des ign for  MC i s  
chal lenged  by the  computat ional  complex i ty  o f  f rac t ional  p ixe l  
interpo lat ion,  the  h igh memory  bandwidth  o f  re ference  f rame re t r ieva l  
and  the  l ong latency o f  external  memory systems.  Many wor ks  (D .  Zhou;  
ISCAS2007 ,  S .  Wang;  ISCAS2005 ,  V.  Sze ;  JSSC2009)  have  been done  to  
reduce  the  complex i ty  o f  para l le l i zed  interpo lat ion,  and  many 
architectures  have been  des igned  (X .  Chen;  IEICE2009,  T.  Chuang;  
ICASSP2009)  t o  reduce  the  memory  bandwidth .  In  th is  d i sser tat ion,  three  
a lgor i thms are  proposed to  obta in more  e f f i c i ent  interpolat ion and less  
on- chip  memory  bandwidth  than the  prev ious  works .  
    In  la test  s tandards ,  MVs  are  a l so  pred ic ted  and compressed f or  b i t  
rate  saving .  Hence  MV decod ing  becomes  a n  important  component  o f  the  
v ideo  decoder  to  restore  the  current  MVs  f rom ad jacent  MVs and MV 
di f f erences .  This  a lso  invo lves  cons iderab le  memory  bandwidth .  Moreover,  
due to  the  f l ex ib le  b lock  s i ze  o f  inter  pred icat ion ,  the  contro l  complex ity  
 3 
o f  MV decod i ng is  cr i t i ca l .  Most  o f  the  prev ious  works  on MV ca l cu lat ion  
architectures  get  var iab le  process ing t ime f or  each MB,  and i t  resul ts  in  
h igh contro l  complex ity  (K .  Yoo ;  ICIP2008,  H .  Yin;  ICALIP2008) .  In  order  
to  decrease  the  contro l  complex i ty,  a  dual -mode based stable  64 cyc les /MB 
pipel ine  is  proposed.  Moreover,  the  s trategy f or  reduc ing the  memory  
bandwidth  i s  a lso  proposed in  Chapter  3 .  
    In  th is  d isser tat ion,  e f f i c i ent  a lgor i thms  and  architectures  are  
proposed to  reduce  the  computat ional  complex i ty  and  memory bandwidth 
o f  inter  predi ct i on ,  and  consequent ly  decrease  the  area  cos t  and memory  
power  consumpt ion o f  v ideo  codec  VLSI.  
    This  d is sertat ion  cons ist s  o f  the  fo l lowing  6  chapters .  
    Chapter  1  [ Introduct ion]  introduces  the  background knowledge  o f  
v ideo  compress ion ,  and the  main cha l lenges  for  inter  pred i ct i on.  The  main 
contr ibut ions  and an overv iew o f  thi s  d is ser tat ion are  a lso  presented  in  
th is  chapter.   
    Chapter  2  [High-Throughput  Mot ion  Compensat ion]  presents  a  16 -65  
cyc les /MB high  throughpu t  mot ion  compensat ion (MC)  arch itecture .   
    1 )  A high -per formance  interpo lator  para l l e l i zes  the  hor izonta l  and 
ver t ica l  f i l t er ing to  e f f i c i ent ly  increase  the  process ing throughput  to  a t  
l east  over  4  t imes  as  the  previous  des igns  o f  S .  Wang ( ISCAS2005)  an d D .  
Zhou (  ISCAS2007) .  When comparing  wi th vert ica l  expans ion architecture  
o f  V.  Sze  (JSSC2009) ,  thi s  work  a lso  increases  the  throughout  to  2  t imes .  
    2 )  An  e f f i c i ent  cache  memory organizat ion scheme (4Sx4)  sp l i ts  the  
memory  width  and stores  the  data  in  an  inter laced  way  to  improve  the  
on- chip  memory  ut i l i zat ion .  As  a  result ,  i t  contr ibutes  memory area sav ing 
o f  25% and memory  power  sav ing  o f  39%~49%,  when compar ing  with  the  X.  
Chen’s  work  ( IEICE2009) .   
    3 )  A Sp l i t  Task Queue (STQ)  archi tec ture  separa tes  the  task  s tor ing 
queues  into  two s tages  o f  the  p ipe l ine  to  hide  the  memory latency and 
reduce  the  p ipe l ine  s ta l l .  Consequent ly,  the  cache id le  t ime i s  saved  by 
90%,  which  contr ibutes  to  reducing  the  overal l  process ing  t ime  by  
24%~40%.    
    Exper imenta l  result s  show th is  des ign is  capable  o f  rea l - t ime  
4kx2k@60fps  decoding at  166MHz,  wi th 108 .8k  l ogi c  gates  and 3 .1kB 
on- chip  memory.  Compar ing  with  current  4kx2k@24fps  decoder  ch ip  (T.  
Chuang;  ISSCC2010) ,  thi s  work  increases  the  throughput  to  2 .5  t imes .  
    Chapter  3  [Ef f ic i ent  Joint  Parameter  Decoder ]  presents  a  j o int  
parameter  decoder  to  real i ze  the  ca l cu lat ion  o f  mot ion  vec tor,  intra  
pred i ct i on mode and  boundary  s trength.  
    1 )  Dual -mode p ipel ine  s cheme categor izes  the  var ious  part i t i on s i zes  
and  predi c t i on  modes  into  two contro l  modes .  And  then,  the  p ipe l ine  i s  
der ived  by  the  s impl i f ied  two  modes  to  increase  sys tem throughout  and 
 4 
reduce  the  contro l  complex ity.  As  a  result ,  a  constant - throughput  o f  
64cyc les /MB i s  obta ined .  The  number  o f  c lock  cyc les  requ ired  f or  
process ing one  MB i s  reduced  by 75% f rom s tate -o f - the-art  works  (K .  Yoo ;  
ICIP2008,  H .  Yin;  ICALIP2008) .   
    2 )  Three - step  bandwidth reduct ion  s trategy is  proposed to  condense  
the  data.  On step  1 ,  a  part i t i on based storage  f ormat  is  app l ied  to  
condense  the  MB level  data .  On s tep  2 ,  var iab le  l ength coding based 
compress ion method i s  ut i l i zed  to  reduce  the  data s i ze  in  each par t i t ion.  
F inal ly,  the  to ta l  bandwidth is  fur ther  reduced by combining the  
co - located  and las t - l ine  in format ion .  Consequent ly,  85%~98% bandwidth  
sav ing  i s  achieved .  
    Exper imenta l  result s  show th is  des ign is  capable  o f  rea l - t ime  
4kx2k@60fps  decoding  at  166MHz,  wi th 37 .2k  l og i c  gates .  Compar ing  with  
K.  Yoo  ( ICIP2008)  and  H.  Yin ( ICALIP2008) ,  the  throughput  o f  our  des ign  
is  increased f rom 260 cyc les /MB to  64 cyc les /MB,  wi th smal ler  log i c  gates .  
    Chapter  4  [Alternat ing  Asymmetr ic  Search  Range  Ass ignment  f or  
Mot ion  Est imat ion]  presents  a l ternat ing asymmetr i c  search range 
ass ignment  (AASRA)  s chemes  f or  mot ion  es t imat ion.  
    1 )  AASRA-B uses  a  large  and a  smal l  search ranges ,  respec t ive ly,  fo r  
the  two  reference  d i rec t ions  o f  b id irec t ional  ME.  The  ass ignment  o f  these  
two search ranges  a l ternates  between pas t  and  future  re ferences  f or  each  
MB/CTB,  enabl ing  ME in both d i rec t ions  to  t rac k  h igh  mot ions .  AASRA-P 
extends  the  appl i cat ion o f  AASRA to  P,  and AASRA -PB combines  the  
features  o f  AASRA -B and  AASRA-P.  The three  s chemes  can  reduce  ME 
complex ity  by 46%,  46% and 70%,  respect ive ly,  wi th smal l  cod ing  
performance  drop .   
    2 )  The  proposed  AASRA schemes  a l so  have  the  adaptabi l i ty  to  be  
combined wi th  many ex ist ing  fast  a lgor i thms and  architectures  to  ach ieve  
an incremental  reduct ion  o f  complex ity.  Necessary  adaptat ions  o f  AASRA 
are  proposed to  combine wi th  the  ex i st ing works .  When combining  Y.  Lin ’s  
(TCSVT2008)  des ign wi th AASRA,  the  hardware  cos t  can  be  reduced  by  
33%.  When implemented  the  arch i tecture  o f  C.  Kao  (TVLSI2010)  w i th 
AASRA,  the  throughput  can be  increased by 43%~64%.   
    Chapter  5  [Archi tec tures  Implemented  in  Video  Decoder  Chip]  
summaries  the  f eatures  o f  inter  pred ic t ion re lated  components  in  the  
implemented the  v ideo  decoder  ch ips .  The proposa ls  in  Chapter  2  and 
Chapter  3  have  been s i l i con proven  in  two decoder  chips .  
    Chapter  6  [ Conc lus ion ]  concludes  the  contr ibut ions  o f  th i s  
d i sser tat ion. 
