138 research outputs found

    Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

    Get PDF
    This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.</jats:p

    Automatic Scansion of Poetry

    Get PDF
    146 p.Lan honetan poesiaren eskantsioa, hau da, poemetako egitura erritmikoaren erauztea, burutzen duguautomatikoki. Horretarako hizkuntzaren prozesamenduko ohiko teknikak erabili ditugu. Metodo batzukerregeletan oinarritutakoak dira, beste batzuk berriz, datuetan oinarritutakoak. Emaitzek iradokitzendute emaitzarik onenak datuetan oinarritutako sistemekin lortutakoak direla.1.- SarreraLehen zutabean dagoen poema osorik irakurrita, erritmo gorabeheratsu (TA-TAN-TA-TAN) konstantebat hauteman daiteke. Bigarren zutabeko lehen adibidea ahoz irakurriko bagenu, TA-RA-TAN modukosoinu bat hautemango genuke. Bigarren adibidea, aldiz, gaztelerazko hendekasilabo bat da, beraz,hamaika soinu unitateko lerroa dugu hura, azken aurreko silaba azentudunarekin. Baina, posible allitzateke horrelako egiturak antzematea hizkuntzaren erabateko ezagutza izan gabe? edo, are gehiago,hizkuntzari buruzko inolako informaziorik gabe, topa al daitezke halako patroiak? HizkuntzarenProzesamenduaren arloko erronkatzat har dezakegu poemetako patroi prosodikoen hautemate hau.Uneko hizkuntzari buruzko informaziorik izan gabe egitura prosodiko hau erauzteko, tradizio poetikoezberdinen azterketa tipologiko bat egitea beharrezkoa dela uste dugu. Bide horretan lehen pausuakemateko ikerlan hau aurkezten dugu, non poesiaren egitura prosodikoa automatikoki aztertzen dugunhizkuntzaren prozesamenduko algoritmo batzuk erabilita. Metodo hauek ingelesezko poemetanaplikatu ditugu emaitza onak lortuaz, eta eredu hoberenak gaztelerazko eta euskarazko corpus banatanaplikatu ditugu.Honako egitura jarraitzen du testu honek: Bigarren atalean eskantsioa definitzen dugu eta tradiziopoetiko ezberdinak aurkezten. Horretaz aparte, poesiaren analisi automatikoaren inguruan egin direnlan batzuk zerrendatzen ditugu. Hirugarren atala lanaren muina dela esan dezakegu, hor aurkeztenbaititugu lan honetarako erabili ditugun corpusak, metodoak eta egindako esperimentuak. Bukaeran,laugarren atalean, esperimentuen ondorioak jartzen ditugu.2.- EskantsioaPoema lerro batean eskantsioa egitea poema horren egitura erritmikoa erauztea da, azentuak, oinaketa errimak adierazita (Baldick, 2015). Lan honetan, ordea, lerro bakoitzaren azentu sekuentzia soilikinferitzen dugu.2.1 Poesia ingelesezHainbat liburu idatzi dira ingelesezko poesiaren prosodiaren inguruan, Halle eta Keyser (1971); Corn(1997); Fabb (1997) eta Steele (1999), adibidez. Ingelesezko poesian silabak oin izeneko multzoetanelkartzen dira. Multzo hauek hainbat silabez osatuta daude, baina ohikoenak bi edo hiru silabakomultzoak dira. Oin hauetako bakoitzak gutxienez gailentzen den silaba bat izango du, azentuatuakontsideratuko duguna. Egitura ohikoenak ianbikoa (bal-loon), trokaikoa (jun-gle), daktilikoa (ac-cident)eta anapestikoa (but I¿m tel-ling you Liz ) dira (Baldick, 2015).Metrika tradizionalaren arabera (Fussell, 1965; Steele, 1999), honelako oinez osatua egongo da lerrometriko oro. Lerroon luzera oin kopuruaren araberakoa izango da, beraz, trimetro batek hiru oin izangoditu, tetrametro batek lau, pentametro batek bost, etab. (hexametro, heptametro, . . . ). Ingelesezkopoesian metrika arruntena pentametro ianbikoa da, adibidez,oh change thy thought, that I may change my mind.non bost azentu argi nabaritzen diren eta TA-TAN multzo bakoitzak oin bat osatzen duen. Poemokorokorrean erregularrak diren arren, ohikoa da aldaketa txiki batzuk egitea egiturotan, helburu estetikoedota artistikoekin.Grant if thou wilt, thou art beloved of manyAurreko adibidearekin alderatuta, honetan hasieran TAN-TA-TA-TAN moduko soinu bat antzematenda. Aldaketa honi, literaturan bariazio trokaiko deitzen zaio. Gainera, lerroa ianbikoa izanda, bukaeraktonikoa behar luke izan, baina aldaketa ohikoa da silaba azentudun baten ostean silaba ez-azentudunbat gehitzea lerroaren bukaeran.2.2 Poesia gaztelaniazGaztelerazko poesian hainbat egitura metriko erabili izan dira (Quilis, 1984; Toma¿s, 1995; Caparro¿s,1999). Lan honetan, corpusaren eskuragarritasuna medio, garai espezifiko batean soilik egin duguenfasia, Espainiako Urrezko Aroan, alegia. Garai honetan gehien erabilitako metrika hendekasilaboaizan zen, lerro bakoitza hamaika silabez osaturik. Lerroetako azentu sekuentzia nahiko erregularra daeta normalean hamargarren silabak azentua darama. Beste silabek ere azentua izan dezakete, etanabarmendutako posizio horien arabera, hendekasilabo hauek hainbat motatakoak izan daitezke.Gaztelerazko poesiaren erronka handienetako bat silaba laburketen erabilera da, sinalefa gisa ezagutzendena, non hamaika silaba baino gehiago dituzten lerroak hamaika silabetan ahokatzen diren. Lan honenhelburua silaba bakoitzari azentu bat automatikoki esleitzea da, ondorioz, metodo erdi-automatiko baterabili dugu sinalefak dauden kasuetan lerroko silaba bakoitzari azentu balio bat esleitzeko.2.3 Poesia euskarazGaur egungo poesian, eta bereziki bertsolaritzan, neurri ezagunik bada, neurri txikiak eta handiak dira.Neurri txikiek lerro bakoitietan zazpi silaba izaten dituzte eta bikoitietan sei. Handiek, ordea, hamarsilaba eta zortzi silaba izaten dituzte lerro bikoiti eta bakoitietan, hurrenez hurren. Ez dira hauek, ordea,poesian erabiltzen diren neurri bakarrak. Idatzizko poesian ohikoa da zortziko ertainaren erabilera, nonlerro bakoitiek zortzi silaba dituzten eta bikoitiek zazpi. Neurri gehienetan lerro bikoitiek elkarrekinerrimatu behar dute.Ikerlan honetan azentuei erreparatzen diegu eta oraindik ez dago argi ea euskarazko poesian azentuekeragin nabarmena duten ala ez. Hainbat adituk idatzi izan dute euskal poesia eta haren neurkerariburuz, XVII. mendetik hasita. Hauek irakurtzean ikuspegi kontrajarriak topa daitezke. Batzuen arabera¿Oihenart eta aita Onaindia, kasu¿ euskal poesian erritmoak garrantzia du, eta poema oroknolabaiteko erritmoa izan behar du.¿Literatur guztiak dabez euren lege ta arauak, olerkigintzan bereziki; euskeran be naitaez izan bear.Lau gauza oneik beintzat gogotan artu bearrak doguz: 1) Igikera (ritmu); 2) etena (cesura); 3)neurria, ta 4) oskide edo azken amaitze bardin¿a (rima).¿Onaindia (1961)Beste batzuk, berriz, euskaraz azentuak eraginik ez duela dio. Nikolas Ormaetxea ¿Orixe¿ da horiesaten duen poeta bat.¿Para probar lo poco sensible que es el acento vasco, inte¿ntese colocar acentos gra¿ficos en las silabasque uno crea acentuadas, enca¿rguese el trabajo a cien personas de buen oido y en una pa¿gina que sesometa al ana¿lisis, se puede asegurar sin temor, que no habra¿ dos que coincidan.¿Ormaechea (1920)2.4 Eskantsio automatikoaAzken urteotan eskantsio automatikoaren inguruan lan ezberdinak egin dira. Lan hauetan, hitzsekuentzia bat sarrera gisa jasota, hauek jarraitzen duten azentu sekuentzia itzultzea izan ohi da burutubeharreko ataza. Itzulpen edo transdukzio prozesu hau hainbat modutara egin daiteke:¿ Erregeletan oinarrituta: Adituek ezarritako arauak jarraituta, hainbat ezaugarri linguistikokontutan izanda.¿ Datuetan oinarrituta: Etiketatutako informazioan oinarrituta, testutik azentuetarako patroiakautomatikoki ikasita. Ildo honi jarraitu diogu aurkezten dugun lan honetan.Urteotan aurkeztu diren lanen artean, arauetan oinarritutakoak Logan (1988); Gervas (2000); Hartman(2005); Plamondon (2006); McAleese (2007); Navarro-Colorado (2015) eta Agirrezabal et al. (2016b)ditugu. Geroz eta entzute handiagoa dute datuetan oinarritutako metodoek, etiketatutakoinformazioaren eskuragarritasuna dela eta. Hauen artean Hayward (1996); Greene et al. (2010); Hayeset al. (2012); Agirrezabal et al. (2016a) eta Estes eta Hench (2016) azpimarratu ditzakegu.3 Corpusak, metodoak eta esperimentuak3.1 CorpusakDatuetan oinarritutako sistemen garapenerako edo erregeletan oinarritutako sistemen ebaluaziorakodatu etiketatuak izatea ezinbestekoa da. Horretarako hiru corpus erabiltzen ditugu, ingelesezko bat,gaztelerazko bat eta euskarazko beste bat. Ingelesezko lanetarako Virginiako unibertsitatean garatutako¿For Better For Verse¿ proiektuaren (Tucker, 2011) emaitza izan den poesia corpusa erabili dugu.Corpus honetan 78 poema daude eta guztira 1.100 poema lerro. Eskantsioa egiterako orduan, lerrobatzuk hainbat analisi izan ditzakete, eta hauek corpusean horrela daude (hainbat aukerarekin).Gaztelerazko esperimentuetarako, lehenago aipatu gisa, Espainiako Urrezko Aroko corpus bat erabilidugu (Navarro-Colorado et al., 2016). Etiketatutako corpusa 135 sonetoz osatuta dago eta gutxigorabehera 2.000 lerro ditu. Euskarazko esperimentuetarako, Patri Urkizuren ¿Poesía vasca: Antologíabilingüe¿ bilduma oinarri hartuta, corpus bat bildu eta eskuz etiketatu dugu. Corpus honek 38 poemaditu eta 2000 lerro inguru.3.2 MetodoakLehen esperimentuak ingelesez egin ditugu eta horiek oinarritzat hartuta, metodo hoberenak gazteleraraeta euskarara estrapolatu ditugu. Lehenik eta behin, erregeletan oinarritutako sistema bat garatu duguinglesezko poesia analizatzeko. Horren ondoren, datuetan oinarritutako tekniketara egin dugu jauzi.Hizkuntzaren prozesamenduan ohikoak diren teknikak aplikatu ditugu datuotatik patroiak ikasi etaaurretik ikusi gabeko poemetan aplikatu ahal izateko. Erabili ditugun teknikak hiru multzotan sailkaditzakegu. Batetik sailkapen arrunta egiten dutenak, sailkapen egituratua egiten dutenak eta sareneuronaletan oinarritutako teknikak.Erabilitako tekniketatik hoberenak perzeptroia (Perceptron) (Freund eta Schapire, 1999), Markoveneredu ezkutuak (Hidden Markov Models) (Rabiner, 1989), ausazko eremu baldintzatuak (ConditionalRandom Fields) (Lafferty et al., 2001) edota epe laburreko memoria luzedun sare neuronalerrekurrenteak (Recurrent Neural Networks with Long Short-Term Memory) (Lample et al., 2016) dira.Teknika eta konfigurazio ezberdinak ebaluatzeko, metodo ezberdinak erabil daitezke. Datu kopuruaoso handia ez denean, gure kasuan bezala, balidazio gurutzatua (K-fold Cross-Validation) erabiltzea daohikoena. Balidazio gurutzatuan datu multzoa k zatitan banatzen da. Behin zati horiek eginda, k ¿ 1zati erabiltzen dira eredu bat ikasteko eta ebaluaziorako bat gordetzen da. Hau k aldiz egiten da, etaasmatze-tasaren batazbestekoa itzultzen da. Gure kasuan, 10 zatitan banatu dugu gure datu-multzoa.3.3 EbaluazioaOndorengo taulan, datuetan oinarritutako metodo hoberenen asmatze-tasak ageri dira. Asmatze-tasahauek silaba mailan kalkulatzen dira.Ondorengo taulan, metodoek lerro mailan lortutako emaitzak agertzen dira.Emaitzen taulan ikus daitekeen moduan, sare neuronaletan oinarritutako sistemek ematen dituzteemaitza onenak, bai ingelesez eta baita gazteleraz ere. Taula horretatik hainbat ondorio plazaraditzakegu.4. OndorioakAgirrezabal et al. (2016a) lanean adierazi genuen Perzeptroiean eta CRFetan erabiltzen ditugun 10atributuak poesiaren analisi prosodikorako egokiak ziren atributuak zirela, bereziki interesgarriakhizkuntzarekiko agnostikoak ziruditelako. Esperimentuotan, gazteleraz probak egin ostean, ikusi duguingelesez nahiko emaitza onak ematen dituztela haien sinpletasuna kontutan hartuta. Gaztelerazkodatuetan, ordea, emaitzak ez dira horren onak izan eta horrek iradokitzen digu atributuok ez direlanahikoak hizkuntzarekiko independenteak diren sistemak eraikitzeko. Dena den, hau baieztatzekohizkuntza gehiagorekin egin beharko genituzke esperimentuok.Emaitzak aztertuta, hitz mugak poemetako egitura prosodikoaren inferentzian garrantzi handia duela ondorioztatzen dugu, bereziki gazteleraz. Horren justifikazioa izan daiteke ingelesezko hitzek batazbestean silaba gutxiago dituztela gazteleraz baino, beheko irudian ikus daitekeen bezalaxe.Gainera, badirudi sare neuronaletan oinarritutako ereduek hitzen egitura fonologikoa ondo modelatzendutela, baina hau enpirikoki frogatzeko esperimentu gehiago beharko lirateke

    Pronunciation modelling in end-to-end text-to-speech synthesis

    Get PDF
    Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    A Unified Framework for Modality-Agnostic Deepfakes Detection

    Full text link
    As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we introduce a comprehensive framework that is agnostic to fake modalities, which facilitates the identification of multimodal deepfakes and handles situations with missing modalities, regardless of the manipulations embedded in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as a preliminary task. This efficiently extracts speech correlations across modalities, a feature challenging for deepfakes to replicate. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments on three audio-visual datasets show that our scheme outperforms state-of-the-art detection methods with promising performance on modality-agnostic audio/video deepfakes.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

    Get PDF
    Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis

    Analysis and automatic identification of spontaneous emotions in speech from human-human and human-machine communication

    Get PDF
    383 p.This research mainly focuses on improving our understanding of human-human and human-machineinteractions by analysing paricipants¿ emotional status. For this purpose, we have developed andenhanced Speech Emotion Recognition (SER) systems for both interactions in real-life scenarios,explicitly emphasising the Spanish language. In this framework, we have conducted an in-depth analysisof how humans express emotions using speech when communicating with other persons or machines inactual situations. Thus, we have analysed and studied the way in which emotional information isexpressed in a variety of true-to-life environments, which is a crucial aspect for the development of SERsystems. This study aimed to comprehensively understand the challenge we wanted to address:identifying emotional information on speech using machine learning technologies. Neural networks havebeen demonstrated to be adequate tools for identifying events in speech and language. Most of themaimed to make local comparisons between some specific aspects; thus, the experimental conditions weretailored to each particular analysis. The experiments across different articles (from P1 to P19) are hardlycomparable due to our continuous learning of dealing with the difficult task of identifying emotions inspeech. In order to make a fair comparison, additional unpublished results are presented in the Appendix.These experiments were carried out under identical and rigorous conditions. This general comparisonoffers an overview of the advantages and disadvantages of the different methodologies for the automaticrecognition of emotions in speech
    corecore