6 research outputs found
SzövegalapĂș nyelvi elemzĆ kiĂ©rtĂ©kelĂ©se gĂ©pi beszĂ©dfelismerĆ hibĂĄkkal terhelt kimenetĂ©n
A cikkĂŒnkben felvĂĄzolt vizsgĂĄlat fĂłkuszĂĄban az ĂĄll, hogy kiderĂŒljön, milyen mĂ©rtĂ©kƱ szintaktikai elemzĂ©st kĂ©pes vĂ©grehajtani a âmagyarlĂĄncâ nyelvi elemzĆ a beszĂ©dfelismerĆ ĂĄltal kibocsĂĄjtott, hibĂĄkkal terhelt szövegeken, Ă©s ez az elemzĂ©s mennyiben âhasonlĂtâ a hibĂĄtlan referenciaszövegen futtatotthoz, illetve azonosĂthatĂł-e az elemzĂ©snek olyan szintje, rĂ©szeredmĂ©nye, amely nagyban korrelĂĄl a hibĂĄtlan szövegĂ©vel. A feladathoz egy hĂradĂłs adatbĂĄzis 535 mondatbĂłl ĂĄllĂł rĂ©szhalmazĂĄt hasznĂĄltuk fel. Ezen a âmagyarlĂĄncâ nyelvi elemzĆvel szintaktikai elemzĂ©st hajtottunk vĂ©gre, mely meghatĂĄrozta a mondatokra a szĂłfaji Ă©s fĂŒggĆsĂ©gi cĂmkĂ©ket. Ezt követĆen a szintaktikai / szemantikai elemzĂ©sek elemi rĂ©szekre (szavakra) törtĂ©nĆ azonosĂtĂĄsa Ă©s felbontĂĄsa következett, majd az ezek halmaza felett megvalĂłsĂtott bag of words reprezentĂĄciĂł vizsgĂĄlata, melyet a korrelĂĄciĂł, hasonlĂłsĂĄg mĂ©rĂ©sĂ©re hasznĂĄltuk fel. TovĂĄbbi összehasonlĂtĂĄs törtĂ©nt a kinyert szĂłfaji Ă©s dependencia tagek tĂĄvolsĂĄgszĂĄmĂtĂĄsĂĄval is, a szĂłhibaarĂĄny szĂĄmĂtĂĄsĂĄval analĂłg mĂłdon. Az eredmĂ©nyek alapjĂĄn elmondhatĂł, a beszĂ©d-szöveg ĂĄtalakĂtĂĄssal nyert szövegeken vĂ©gzett elemzĂ©s nagyban korrelĂĄl a hibĂĄktĂłl mentes referenciaĂĄtiraton vĂ©gzettel
NagyszĂłtĂĄras beszĂ©dfelismerĂ©s morfĂ©maalapĂș rekurrens nyelvi modell hasznĂĄlatĂĄval
A klasszikus beszĂ©dfelismerĆ rendszerek szĂĄmĂĄra hatalmas kihĂvĂĄst jelentenek az agglutinĂĄlĂł nyelvek, hiszen pontos eredmĂ©nyek elĂ©rĂ©sĂ©hez hatalmas szĂłtĂĄrakra van szĂŒksĂ©g a ragozĂĄs Ă©s a szóösszetĂ©tel miatt. A problĂ©ma fĆleg a nyelvi modell rĂ©szĂ©t Ă©rinti a felismerĆnek, tekintve, hogy tĂșl nagy szĂłtĂĄrmĂ©ret esetĂ©n a tanulĂĄsi fĂĄzis rendkĂvĂŒl nehĂ©z, ez pedig szuboptimĂĄlis modellhez vezethet. Ezen problĂ©mĂĄra megoldĂĄst jelenthet, ha szavak helyett azoknĂĄl kisebb egysĂ©get, morfĂ©mĂĄkat hasznĂĄlunk a nyelvi modellezĂ©s sorĂĄn. A cikkben bemutatĂĄsra kerĂŒl egy morfĂ©maalapĂș, rekurrens neuronhĂĄlĂłs nyelvi modellt alkalmazĂł beszĂ©dfelismerĆ, amely hasznĂĄlatĂĄval szignifikĂĄnsan jobb eredmĂ©nyeket tudtunk elĂ©rni egy magyar nyelvƱ beszĂ©dkorpuszon mint a hagyomĂĄnyos szĂłszintƱ megközelĂtĂ©ssel
Magyar nyelvƱ, Ă©lĆ közĂ©leti- Ă©s hĂrmƱsorok gĂ©pi feliratozĂĄsa
CikkĂŒnkben egy valĂłs idejƱ, kis erĆforrĂĄs-igĂ©nyƱ gĂ©pi beszĂ©d-szöveg ĂĄtalakĂtĂł rendszert mutatunk be, melyet elsĆsorban televĂziĂłs közĂ©leti tĂĄrsalgĂĄsi beszĂ©d feliratozĂĄsĂĄra fejlesztettĂŒnk ki. MegoldĂĄsunkat összevetjĂŒk a tĂ©materĂŒleten legelterjedtebben hasznĂĄlt nyĂlt forrĂĄskĂłdĂș keretrendszer, a Kaldi dekĂłderĂ©vel is. Ezen felĂŒl kĂŒlönbözĆ adatbĂĄzis-mĂ©retek mellett Ă©s ĂșjrabeszĂ©lĂ©s alkalmazĂĄsĂĄval is vĂ©gzĂŒnk felismerĂ©si kĂsĂ©rleteket. KĂsĂ©rleti rendszerĂŒnkkel, mely egy több mint 70 milliĂł szĂłt tartalmazĂł szövegkorpuszon Ă©s egy közel 500 ĂłrĂĄs beszĂ©dadatbĂĄzison lett tanĂtva sikerĂŒlt az eddig publikĂĄlt legalacsonyabb szĂłhibaarĂĄnyt elĂ©rnĂŒnk magyar nyelvƱ, televĂziĂłs hĂradĂłk Ă©s közĂ©leti tĂĄrsalgĂĄsi beszĂ©d tĂ©makörĂ©n
Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.Peer reviewe