65 research outputs found
Oesophageal speech: enrichment and evaluations
167 p.After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics.First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power.Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech.Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community
ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION
Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria.
Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal.
Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system
Models and Analysis of Vocal Emissions for Biomedical Applications
The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the newborn to the adult and elderly. Over the years the initial issues have grown and spread also in other fields of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years in Firenze, Italy. This edition celebrates twenty-two years of uninterrupted and successful research in the field of voice analysis
Recent Advances in Indoor Localization Systems and Technologies
Despite the enormous technical progress seen in the past few years, the maturity of indoor localization technologies has not yet reached the level of GNSS solutions. The 23 selected papers in this book present the recent advances and new developments in indoor localization systems and technologies, propose novel or improved methods with increased performance, provide insight into various aspects of quality control, and also introduce some unorthodox positioning methods
Proceedings of the Eleventh Annual Precise Time and Time Interval (PTTI) Application and Planning Meeting
Thirty eight papers are presented addressing various aspects of precise time and time interval applications. Areas discussed include: past accomplishments; state of the art systems; new and useful applications, procedures, and techniques; and fruitful directions for research efforts
Efficient, end-to-end and self-supervised methods for speech processing and generation
Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored.
Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models.
Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaciĂł de la parla en vĂ ries direccions. Primer, les arquitectures fi-a-fi permeten la injecciĂł i sĂntesi de mostres temporals directament. D'altra banda, amb l'exploraciĂł de solucions eficients permet l'aplicaciĂł d'aquests sistemes en entorns de computaciĂł restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'Ă udio i veu per derivar-ne representacions amb la mĂnima supervisiĂł. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'Ăşs d'estructures pseudo-recurrents recents, com els models d’auto atenciĂł i les xarxes quasi-recurrents, per a construir models acĂşstics text-a-veu. AixĂ, el sistema QLAD proposat en aquest treball sintetitza mĂ©s rĂ pid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de sĂntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaciĂł es proposa un model de xarxa adversĂ ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaciĂł d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que tambĂ© treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracciĂł de soroll i preservaciĂł de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clĂ ssics i models regressius basats en xarxes neuronals profundes en espectre. TambĂ© es demostra que la SEGAN Ă©s eficient transferint les seves operacions a nous llenguatges i sorolls. AixĂ, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catalĂ o al coreĂ amb nomĂ©s 24 segons de dades d'adaptaciĂł. Finalment, explorem l'Ăşs de tota la capacitat generativa del model i l’apliquem a recuperaciĂł de senyals de veu malmeses per vĂ ries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaciĂł de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucciĂł de parts del senyal que s’han malmès, com extensiĂł d’ample de banda i recuperaciĂł de seccions temporals perdudes, entre d’altres. En aquesta Ăşltima aplicaciĂł del model, el fet d’incloure funcions de pèrdua acĂşsticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracterĂstiques acĂşstiques a la sortida de la xarxa discriminadora de la nostra GAN. TambĂ© es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversĂ ria i la qualitat generada finalment desprĂ©s d’afegir les funcions acĂşstiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE Ă©s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaciĂł abstracta com identitat del parlant, les caracterĂstiques prosòdiques i els continguts lingĂĽĂstics. TambĂ© es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’à mbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emociĂł i de la parla, mostrant-se efectiu especialment si s’ajusta la representaciĂł de manera supervisada amb un conjunt de dades d’adaptaciĂł.Postprint (published version
Efficient, end-to-end and self-supervised methods for speech processing and generation
Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored.
Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models.
Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaciĂł de la parla en vĂ ries direccions. Primer, les arquitectures fi-a-fi permeten la injecciĂł i sĂntesi de mostres temporals directament. D'altra banda, amb l'exploraciĂł de solucions eficients permet l'aplicaciĂł d'aquests sistemes en entorns de computaciĂł restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'Ă udio i veu per derivar-ne representacions amb la mĂnima supervisiĂł. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'Ăşs d'estructures pseudo-recurrents recents, com els models d’auto atenciĂł i les xarxes quasi-recurrents, per a construir models acĂşstics text-a-veu. AixĂ, el sistema QLAD proposat en aquest treball sintetitza mĂ©s rĂ pid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de sĂntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaciĂł es proposa un model de xarxa adversĂ ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaciĂł d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que tambĂ© treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracciĂł de soroll i preservaciĂł de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clĂ ssics i models regressius basats en xarxes neuronals profundes en espectre. TambĂ© es demostra que la SEGAN Ă©s eficient transferint les seves operacions a nous llenguatges i sorolls. AixĂ, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catalĂ o al coreĂ amb nomĂ©s 24 segons de dades d'adaptaciĂł. Finalment, explorem l'Ăşs de tota la capacitat generativa del model i l’apliquem a recuperaciĂł de senyals de veu malmeses per vĂ ries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaciĂł de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucciĂł de parts del senyal que s’han malmès, com extensiĂł d’ample de banda i recuperaciĂł de seccions temporals perdudes, entre d’altres. En aquesta Ăşltima aplicaciĂł del model, el fet d’incloure funcions de pèrdua acĂşsticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracterĂstiques acĂşstiques a la sortida de la xarxa discriminadora de la nostra GAN. TambĂ© es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversĂ ria i la qualitat generada finalment desprĂ©s d’afegir les funcions acĂşstiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE Ă©s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaciĂł abstracta com identitat del parlant, les caracterĂstiques prosòdiques i els continguts lingĂĽĂstics. TambĂ© es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’à mbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emociĂł i de la parla, mostrant-se efectiu especialment si s’ajusta la representaciĂł de manera supervisada amb un conjunt de dades d’adaptaciĂł
Proceedings of the Fifth International Mobile Satellite Conference 1997
Satellite-based mobile communications systems provide voice and data communications to users over a vast geographic area. The users may communicate via mobile or hand-held terminals, which may also provide access to terrestrial communications services. While previous International Mobile Satellite Conferences have concentrated on technical advances and the increasing worldwide commercial activities, this conference focuses on the next generation of mobile satellite services. The approximately 80 papers included here cover sessions in the following areas: networking and protocols; code division multiple access technologies; demand, economics and technology issues; current and planned systems; propagation; terminal technology; modulation and coding advances; spacecraft technology; advanced systems; and applications and experiments
Abstracts on Radio Direction Finding (1899 - 1995)
The files on this record represent the various databases that originally composed the CD-ROM issue of "Abstracts on Radio Direction Finding" database, which is now part of the Dudley Knox Library's Abstracts and Selected Full Text Documents on Radio Direction Finding (1899 - 1995) Collection. (See Calhoun record https://calhoun.nps.edu/handle/10945/57364 for further information on this collection and the bibliography).
Due to issues of technological obsolescence preventing current and future audiences from accessing the bibliography, DKL exported and converted into the three files on this record the various databases contained in the CD-ROM.
The contents of these files are:
1) RDFA_CompleteBibliography_xls.zip [RDFA_CompleteBibliography.xls: Metadata for the complete bibliography, in Excel 97-2003 Workbook format; RDFA_Glossary.xls: Glossary of terms, in Excel 97-2003 Workbookformat; RDFA_Biographies.xls: Biographies of leading figures, in Excel 97-2003 Workbook format];
2) RDFA_CompleteBibliography_csv.zip [RDFA_CompleteBibliography.TXT: Metadata for the complete bibliography, in CSV format; RDFA_Glossary.TXT: Glossary of terms, in CSV format; RDFA_Biographies.TXT: Biographies of leading figures, in CSV format];
3) RDFA_CompleteBibliography.pdf: A human readable display of the bibliographic data, as a means of double-checking any possible deviations due to conversion
Deep Learning Methods for Dialogue Act Recognition using Visual Information
RozpoznávánĂ dialogovĂ˝ch aktĹŻ (DA) je dĹŻleĹľitĂ˝m krokem v Ĺ™ĂzenĂ a porozumÄ›nĂ dialogu. Tato Ăşloha spoÄŤĂvá v automatickĂ©m pĹ™iĹ™azenĂ tĹ™Ădy k vĂ˝roku/promluvÄ› (nebo jeho části) na základÄ› jeho funkce v dialogu (napĹ™. prohlášenĂ, otázka, potvrzenĂ atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánnĂch dialogĹŻ. I kdyĹľ je rozpoznávánĂ DA obvykle realizováno na zvukovĂ©m signálu (Ĺ™eÄŤi) pomocĂ modelĹŻ pro automatickĂ© rozpoznávánĂ Ĺ™eÄŤi, dialogy existujĂ rovněž ve formÄ› obrázkĹŻ (napĹ™. komiksy).
Tato práce se zabĂ˝vá automatickĂ˝m rozpoznávánĂm dialogovĂ˝ch aktĹŻ z obrazovĂ˝ch dokumentĹŻ.
Dle nás se jedná o prvnĂ pokus o navrĹľenĂ pĹ™Ăstupu rozpoznávánĂ DA vyuĹľĂvajĂcĂ obrázky jako vstup.
Pro tento Ăşkol je nutnĂ© extrahovat text z obrázkĹŻ. VyuĹľĂváme proto algoritmy z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ a~zpracovánĂ obrazu, jako je prahovánĂ obrazu, segmentace textu a optickĂ© rozpoznávánĂ znakĹŻ (OCR). HlavnĂm pĹ™Ănosem v tĂ©to oblasti je návrh a implementace OCR modelu zaloĹľenĂ©ho na konvoluÄŤnĂch a rekurentnĂch neuronovĂ˝ch sĂtĂch. TakĂ© prozkoumáváme rĹŻznĂ© strategie pro trĂ©novánĂ tohoto modelu, vÄŤetnÄ› generovánĂ syntetickĂ˝ch dat a technik rozšiĹ™ovánĂ dat (tzv. augmentace).
Dosahujeme vynikajĂcĂch vĂ˝sledkĹŻ OCR v pĹ™ĂpadÄ›, kdy je malĂ© mnoĹľstvĂ trĂ©novacĂch dat. Mezi naše pĹ™Ănosy tedy patřà to, jak vytvoĹ™it efektivnĂ OCR systĂ©m s~minimálnĂmi náklady na ruÄŤnĂ anotaci.
Dále se zabĂ˝váme vĂcejazyÄŤnostĂ v oblasti rozpoznávánĂ DA. ĂšspěšnÄ› jsme pouĹľili a nasadili obecnĂ˝ model, kterĂ˝ byl trĂ©nován všemi dostupnĂ˝mi jazyky, a takĂ© dalšà modely, kterĂ© byly trĂ©novány pouze na jednom jazyce, a vĂcejazyÄŤnosti je dosaĹľeno pomocĂ transformacĂ sĂ©mantickĂ©ho prostoru.
TakĂ© zkoumáme techniku pĹ™enosu uÄŤenĂ (tzv. transfer learning) pro tuto Ăşlohu tam, kde je k dispozici malĂ˝ poÄŤet anotovanĂ˝ch dat. PouĹľĂváme pĹ™Ăznaky jak na Ăşrovni slov, tak i vÄ›t a naše modely hlubokĂ˝ch neuronovĂ˝ch sĂtĂ (vÄŤetnÄ› architektury Transformer) dosáhly vĂ˝bornĂ˝ch vĂ˝sledkĹŻ v oblasti vĂcejazyÄŤnĂ©ho rozpoznávánĂ dialogovĂ˝ch aktĹŻ.
Pro rozpoznávánĂ DA z obrazovĂ˝ch dokumentĹŻ navrhujeme novĂ˝ multimodálnĂ model zaloĹľenĂ˝ na konvoluÄŤnĂ a rekurentnĂ neuronovĂ© sĂti. Tento model kombinuje textovĂ© a obrazovĂ© vstupy. Textová část zpracovává text z OCR, zatĂmco vizuálnà část extrahuje obrazovĂ© pĹ™Ăznaky, kterĂ© tvořà dalšà vstup do modelu. Text z OCR obsahuje ÄŤasto pĹ™eklepy nebo jinĂ© lexikálnĂ chyby. Demonstrujeme na experimentech, Ĺľe tento multimodálnĂ model vyuĹľĂvajĂcĂ dva vstupy dokáže částeÄŤnÄ› vyvážit ztrátu informace zpĹŻsobenou chybovostĂ OCR systĂ©mu.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books).
This thesis deals with automatic dialogue act recognition from image documents.
To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input.
For this task, it is necessary to extract the text from the images.
Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs.
We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field.
For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information
- …