4 research outputs found

    Impact of frame rate on automatic speech-text alignment for corpus-based phonetic studies

    Get PDF
    International audiencePhonetic segmentation is the basis for many phonetic and linguistic studies. As manual segmentation is a lengthy and tedious task, automatic procedures have been developed over the years. They rely on acoustic Hidden Markov Models. Many studies have been conducted, and refinements developed for corpus based speech synthesis, where the technology is mainly used in a speaker-dependent context and applied on good quality speech signals. In a different research direction, automatic speech-text alignment is also used for phonetic and linguistic studies on large speech corpora. In this case, speaker independent acoustic models are mandatory, and the speech quality may not be so good. The speech models rely on 10 ms shift between acoustic frames, and their topology leads to strong minimum duration constraints. This paper focuses on the acoustic analysis frame rate, and gives a first insight on the impact of the frame rate on corpus-based phonetic studies

    A METHOD FOR AUTOMATIC ANALYSIS OF SPEECH TEMPO

    Get PDF
    U ovom radu opisana je metoda analize brzine govora ili tempa na osnovu uzoraka govora dobivenih s televizijskih kanala koji sadrže tekst izgovorenog u obliku titlova. Za prepoznavanje govora korištena je nepovratna neuronska mreža (engl. feed-forward neural network) trenirana s oko 160 sekundi govora. Da bi se odredile granice pojedinačnih riječi napravljena je komponenta za poravnavanje govora s tekstom koja pronalazi prihvatljivo podudaranje slova teksta s fonemima koje je klasificirala neuronska mreža. Komponenta za poravnavanje uzima u obzir kategorije fonema za koje neuronska mreža ima veću preciznost klasifikacije. Preliminarni rezultati pokazuju prosječne promašaje poravnavanja od jednog do tri fonema, zavisno od govornika, sadržaja izgovorenog i kvalitete snimke.This paper describes a method for analysing speed of speech or tempo using speech recordings from Croatian TV news channels with subtitles. A feed-forward neural network was used for phoneme classification, trained with 160 seconds of recorded speech. To determine individual word positions a component for speech-to-text alignment was created which finds aproximate alignments of text from the subtitles and phonemes classified by the neural network. The alignment component relies on the fact that the neural network recognizes some groups of phonemes better than others. Preliminary results showed an average alignment offset of one to about three phonemes, depending on the recording quality, speaker and the content
    corecore