1 research outputs found
Audiovisual speech recognition by utilizing methods for automatic lipreading
AutomatickĂ© odezĂránĂ ze rtĹŻ je oborem vyvĂjejĂcĂm se na pomezĂ automatickĂ©ho rozpoznávánĂ Ĺ™eÄŤi, strojovĂ©ho uÄŤenĂ a poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ jiĹľ vĂce neĹľ 20 let. Ani pĹ™es vĂ˝znamnĂ© pokroky od doby svĂ©ho uvedenĂ se však audiovizuálnĂ systĂ©my rozpoznávánĂ Ĺ™eÄŤi v praxi vĂ˝raznÄ› neprosadily a to z nÄ›kolika dĹŻvodĹŻ. Jeden z klĂÄŤovĂ˝ch pĹ™edpokladĹŻ, návrh robustnĂ parametrizace, zde navĂc s vyuĹľitĂm informace o trojrozmÄ›rnĂ© podobÄ› povrchu Ăşst, je pĹ™edmÄ›tem tĂ©to dizertaÄŤnĂ práce.Text je rozdÄ›len do 12 kapitol. Kapitoly 25 rozebĂrajĂ souÄŤasnĂ˝ stav problematiky rozdÄ›lenĂm na nÄ›kolik dĂlÄŤĂch podproblĂ©mĹŻ. V kapitole 2 je uveden pĹ™ehled algoritmĹŻ pro zarovnánĂ obliÄŤeje a detekce zájmovĂ© oblasti. NejvÄ›tšà pozornost je vÄ›nována parametrizaci vizuálnĂho signálu v kapitole 3. NásledujĂcĂ kapitoly 4 a 5 popisujĂ metody klasifikace a moĹľnosti integrace vizuálnĂ informace do akustickĂ˝ch Ĺ™eÄŤovĂ˝ch dekodĂ©rĹŻ. PĹ™ehled nejÄŤastÄ›ji vyuĹľĂvanĂ˝ch audiovizuálnĂch databázĂ je uveden v kapitole 6. RešerĹľnà část práce je uzavĹ™ena kapitolou 7, která porovnává nejlepšà doposud dosaĹľenĂ© vĂ˝sledky v dostupnĂ© literatuĹ™e. SamostatnÄ› jsou posouzeny vizuálnĂ a audiovizuálnĂ systĂ©my a navĂc je problematika rozdÄ›lena dle typu rozpoznávanĂ˝ch promluv a závislosti na mluvÄŤĂch. ZohlednÄ›n je rovněž vliv vizuálnĂho pĹ™edzpracovánĂ.V práci jsou navrĹľeny tĹ™i novĂ© vizuálnĂ parametrizace Ĺ™eÄŤi: trojrozmÄ›rná bloková diskrĂ©tnĂ kosinová transformace (DCT3), prostoro-ÄŤasovÄ› modifikovanĂ˝ histogram orientovanĂ˝ch gradientĹŻ (HOGTOP) a rozšĂĹ™enĂ˝ aktivnĂ vzhledovĂ˝ model (DAAM). Jejich návrh, popsanĂ˝ v kapitole 8, směřuje pĹ™edevšĂm k vyuĹľitĂ Ĺ™eÄŤovĂ© dynamiky a zrobustnÄ›nĂ klasickĂ©ho AAM integracĂ hloubkovĂ˝ch dat jakoĹľto zjednodušenĂ© formy informace o trojrozmÄ›rnĂ© podobÄ› rtĹŻ.Za účelem vyhodnocenĂ navrĹľenĂ˝ch i v souÄŤasnĂ© dobÄ› existujĂcĂch parametrizacĂ je vytvoĹ™ena audiovizuálnĂ databáze TULAVD obsahujĂcĂ 54 mluvÄŤĂch, viz kapitolu 9. Databáze je navrĹľena i s ohledem na automatickĂ© rozpoznávánĂ spojitĂ© Ĺ™eÄŤi s velkĂ˝m slovnĂkem (LVCSR). Samostatná sekce je vÄ›nována návrhu testovacĂho protokolu, kterĂ˝ zamezuje optimalizaci modelĹŻ na testovaná data a vĂ˝sledky v experimentálnà části tak nejsou zatĂĹľeny pozitivnĂ zaujatostĂ.Experimentálnà část v kapitole 10 se vÄ›nuje pĹ™edevšĂm evaluaci navrĹľenĂ˝ch parametrizacĂ a srovnánĂ existujĂcĂch na Ăşloze rozpoznávánĂ izolovanĂ˝ch slov. KromÄ› TULAVD je Ăşspěšnost vlastnĂ parametrizace demonstrována na dalšĂch dvou známĂ˝ch databázĂch pro moĹľnost pĹ™ĂmĂ©ho srovnánĂ se stavem poznánĂ. Rovněž je samostatnÄ› demonstrován pozitivnĂ pĹ™Ănos hloubkovĂ˝ch dat rekonstruovanĂ˝ch pomocĂ MS Kinect. Druhá část experimentĹŻ v kapitole 11 je pak zaměřena vyhodnocenĂ vlivu vizuálnĂ informace v Ăşloze LVCSR s rĹŻznÄ› velkĂ˝mi slovnĂky od nÄ›kolika stovek do pÄ›ti set tisĂc slov.Automatic lip reading is a research field closely related to automatic speech recognition, machine learning and computer vision. Despite being developed for more than two decades, systems for audiovisual speech recognition are still not widely used in practice due to several reasons. One critical component, namely the design of a robust and discriminative visual parametrization, here also with utilization of information about depth, is the main topic of this dissertation thesis.The text of the dissertation consists of 12 chapters. Chapters 25 present the current state of the art and each focuses on one specific subproblem of visual and audiovisual speech recognition. Chapter 2 investigates methods for face alignment and detection of the region of interest. Commonly used features and algorithms of their extraction are examined in chapter 3, followed by an overview of classification methods in chapter 4, fusion of multiple sources of information in chapter 5, and existing audiovisual datasets in chapter 6. The first part of the thesis examining the state of the art is summarized in chapter 7, which compares currently the best results achieved on various commonly used datasets with respect to recognition grammar, vocabulary size, speaker dependency and visual preprocessing.Three different robust visual parametrizations are proposed and explained in chapter 8: block-based three-dimensional discrete cosine transform (DCT3), spatiotemporal histogram of oriented gradients (HOGTOP), and depth-extended active appearance model (DAAM). While the former two are ROI-based source-agnostic parametrizations designed mainly to exploit the speech dynamics, DAAM directly integrates depth data obtained via Kinect in order to achieve greater robustness against lightning variations and better phone discrimination.In order to evaluate the existing and proposed features on both video and depth data, new database called TULAVD has been recorded. As described in chapter 9, each of the 54 speakers uttered 50 isolated words and 100 gramatically unrestricted sentences in Czech language. Special section is devoted to the design of the evaluation protocol that minimizes the risk of overfitting when tuning the decoder.Experiments in chapter 10 evaluate selected popular and proposed features in the task of isolated unit recognition. In order to compare the achieved results to the state of the art, two other commonly used datasets besides TULAVD are included: OuluVS and CUAVE. Experiments on multiple modality fusion show the benefit of adding the Kinect depth data into the recognition process for both feature fusion and integration via multistream hidden Markov model. As opposed to the vast majority of recent work on lipreading, the above mentioned evaluation is also performed in the task of large vocabulary continuous speech recognition with gradually increasing vocabulary size from several hundreds to half a million, see chapter 11