68 research outputs found

    Polish Phoneme Statistics Obtained On Large Set Of Written Texts

    Get PDF
    The phonetical statistics were collected from several Polish corpora. The paper is a summaryof the data which are phoneme n-grams and some phenomena in the statistics. Triphonestatistics apply context-dependent speech units which have an important role in speech recognitionsystems and were never calculated for a large set of Polish written texts. The standardphonetic alphabet for Polish, SAMPA, and methods of providing phonetic transcriptions are described

    Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

    Get PDF
    Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

    Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

    Get PDF
    Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

    Evaluation of automatic transcription systems for the judicial domain

    Full text link
    This paper describes two different automatic transcription systems developed for judicial application domains for the Polish and Italian languages. The judicial domain requires to cope with several factors which are known to be critical for automatic speech recognition, such as: background noise, reverberation, spontaneous and accented speech, overlapped speech, cross channel effects, etc. The two automatic speech recognition (ASR) systems have been developed independently starting from out-of-domain data and, then, they have been adapted to the judicial domain using a certain amount of in-domain audio and text data. The ASR performance have been measured on audio data acquired in the courtrooms of Naples and Wroclaw. The resulting word error rates are around 40%, for Italian, and around between 30% and 50% for Polish. This performance, similar to that reported for other comparable ASR tasks (e.g. meeting transcriptions with distant microphone), suggests that possible applications can address tasks such as indexing and/or information retrieval in multimedia documents recorded during judicial debates

    Dynamic Formant Trajectories in German Read Speech : Impact of Predictability and Prominence

    Get PDF
    Phonetic structures expand temporally and spectrally when they are difficult to predict from their context. To some extent, effects of predictability are modulated by prosodic structure. So far, studies on the impact of contextual predictability and prosody on phonetic structures have neglected the dynamic nature of the speech signal. This study investigates the impact of predictability and prominence on the dynamic structure of the first and second formants of German vowels. We expect to find differences in the formant movements between vowels standing in different predictability contexts and a modulation of this effect by prominence. First and second formant values are extracted from a large German corpus. Formant trajectories of peripheral vowels are modeled using generalized additive mixed models, which estimate nonlinear regressions between a dependent variable and predictors. Contextual predictability is measured as biphone and triphone surprisal based on a statistical German language model. We test for the effects of the information-theoretic measures surprisal and word frequency, as well as prominence, on formant movement, while controlling for vowel phonemes and duration. Primary lexical stress and vowel phonemes are significant predictors of first and second formant trajectory shape. We replicate previous findings that vowels are more dispersed in stressed syllables than in unstressed syllables. The interaction of stress and surprisal explains formant movement: unstressed vowels show more variability in their formant trajectory shape at different surprisal levels than stressed vowels. This work shows that effects of contextual predictability on fine phonetic detail can be observed not only in pointwise measures but also in dynamic features of phonetic segments

    The RWTH Aachen German and English LVCSR systems for IWSLT-2013

    Get PDF
    Abstract In this paper, German and English large vocabulary continuous speech recognition (LVCSR) systems developed by the RWTH Aachen University for the IWSLT-2013 evaluation campaign are presented. Good improvements are obtained with state-of-the-art monolingual and multilingual bottleneck features. In addition, an open vocabulary approach using morphemic sub-lexical units is investigated along with the language model adaptation for the German LVCSR. For both the languages, competitive WERs are achieved using system combination

    Information density and phonetic structure: Explaining segmental variability

    Get PDF
    There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners' perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners' speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.Informationstheoretische Faktoren beeinflussen die VariabilitĂ€t gesprochener Sprache. Phonetische Strukturen sind lĂ€nger und zeigen erhöhte spektrale DistinktivitĂ€t, wenn sie aufgrund ihres Kontextes leicht vorhersagbar sind als Strukturen, die schwer vorhersagbar sind. Die meisten Studien beruhen auf Daten aus dem amerikanischen Englisch. Nur wenige betonen die Notwendigkeit fĂŒr mehr sprachliche DiversitĂ€t. Als Resultat dieser Erkenntnisse haben Aylett und Turk (2004, 2006) die Smooth Signal Redundancy Hypothese aufgestellt, die besagt, dass der Effekt von Vorhersagbarkeit auf phonetische Strukturen nicht direkt, sondern nur die prosodische Struktur umgesetzt wird. In dieser Arbeit werden der Einfluss und die Interaktion von Informationsdichte und prosodischen Strukturen auf segmentelle VariabilitĂ€t im Deutschen sowie die WahrnehmungsfĂ€higkeit von Unterschieden im phonetischen Detail aufgrund ihrer Vorhersagbarkeit untersucht. Informationsdichte (ID) wird definiert als kontextuelle Vorhersagbarkeit oder Surprisal (S(unit_i) = -log2 P(unit_i|context)). ZusĂ€tzlich zu Surprisal verwenden wir auch Wortfrequenz und prosodische Faktoren, wie primĂ€re Wortbetonung, prosodische Grenze und Sprechgeschwindigkeit als Variablen in der statistischen Analyse. Akustisch-phonetische Maße sind SegmentlĂ€nge und -löschung, voice onset time (VOT), Vokaldispersion, globale und dynamische vokalische Eigenschaften und StimmqualitĂ€t. Vokaldispersion wird nicht nur im Deutschen, sondern auch in einer sprachĂŒbergreifenden Analyse und im Kontext von L2 untersucht. Wir können vorherige Ergebnisse, die auf dem Amerikanischen beruhten, fĂŒr das Deutsche replizieren. Reduzierte SegmentlĂ€nge und VOT, höhere Wahrscheinlichkeit der Löschung und geringere Vokaldispersion werden auch fĂŒr leicht vorhersagbare Segmente im Deutschen beobachtet. Diese zeigen auch weniger Formantenbewegung, reduzierte Kurvigkeit in F2 sowie erhöhte Behauchtheitswerte als Vokale, die schwer vorhersagbar sind. Die Ergebnisse fĂŒr Wortfrequenz zeigen Ă€hnliche Tendenzen: Deutsche Segmente in hochfrequenten Wörtern sind kĂŒrzer, werden eher gelöscht, zeigen reduzierte Werte fĂŒr Vokaldispersion, Formantenbewegungen und PeriodizitĂ€t als deutsche Segmente in Wörtern mit geringer Frequenz. Obwohl wir bekannte Effekte fĂŒr Betonung, Grenze und Tempo auf segmentelle VariabilitĂ€t in den Modellen beobachten, sind die Effekte von ID signifikant. Die sprachĂŒbergreifende Analyse zeigt zudem, dass diese Effekte auch robust fĂŒr die meisten der untersuchten Sprachen sind und sich in allen intendierten Sprechgeschwindigkeiten zeigen. Surprisal hat allerdings keinen Einfluss auf die Vokaldispersion von Sprachlernern. Des weiteren finden wir Interaktionseffekte zwischen Surprisal und den prosodischen Faktoren. Besonders fĂŒr Wortbetonung lĂ€sst sich ein stabiler positiver Interaktionseffekt mit Surprisal feststellen. In der Perzeption sind Hörer durchaus in der Lage, Unterschiede zwischen manipulierten und nicht manipulierten Stimuli zu erkennen, wenn die Manipulation lediglich im phonetischen Detail des Zielwortes aufgrund von Vorhersagbarkeit besteht

    Multilingual training of deep neural networks

    Get PDF
    We investigate multilingual modeling in the context of a deep neural network (DNN) – hidden Markov model (HMM) hy-brid, where the DNN outputs are used as the HMM state like-lihoods. By viewing neural networks as a cascade of fea-ture extractors followed by a logistic regression classifier, we hypothesise that the hidden layers, which act as feature ex-tractors, will be transferable between languages. As a corol-lary, we propose that training the hidden layers on multiple languages makes them more suitable for such cross-lingual transfer. We experimentally confirm these hypotheses on the GlobalPhone corpus using seven languages from three dif-ferent language families: Germanic, Romance, and Slavic. The experiments demonstrate substantial improvements over a monolingual DNN-HMM hybrid baseline, and hint at av-enues of further exploration. Index Terms — Speech recognition, deep learning, neural networks, multilingual modelin
    • 

    corecore