    Pausing strategies with regard to speech style

    Speech is occasionally interrupted by silent and filled pauses of various length. Pauses have many different functions in spontaneous speech (e.g. breathing, marking syntactic boundaries as well as speech planning difficulties, time for self-repair). The aim of the study was the analysis of the interrelation between the temporal pattern and the syntactical position of silent pauses (SP) on one hand. On the other hand, filled pauses (FP) were also analyzed according to their phonetic realization, as well as the combination of SPs and FPs. The effect of speech style on pausing strategies was also analyzed. A narrative recording and a conversational recording from 10 speakers (ages between 20 and 35 years, 5 male, 5 female) were selected from Hungarian Spontaneous Speech Database for the study. The material was manually annotated, silent pauses were categorized, then the duration of pauses were extracted. Results showed that the position of silent and filled pauses affects their duration. The speech style did not influenced the frequency of pauses. However, silent and filled pauses were longer in narratives than in conversations. Results suggest that pausing strategies are similar in general; however, the timing patterns of pauses may depend on various factors, e.g. speech style

    Kriminalisztikai alapú beszélőiprofil-alkotás

    A beszélő hangja alapján az ismert személyek felismerése mellett képesek vagyunk az ismeretlen személyekről profilt készíteni, vagyis olyan általános információkat becsülni, mint például a nem, az életkor, a testalkat vagy a beszélő hangulata. Korábbi kutatások igazolták, hogy erős összefüggés van a toldalékcső hossza és a beszélő személy fizikai állapota, mint az életkor, a nem, a testmagasság stb. között. Ezen összefüggés alapján feltételezzük, hogy az emberi beszéd akusztikai jellemzői kódolják az adott beszélő testi fizikai felépítésére utaló jegyeket. A jelen kutatásban ezen összefüggés érvényességét vizsgáljuk tanuló algoritmusok segítségével. A kutatásban elemezzük, hogy a beszédből milyen eredményességgel lehet automatikusan becsülni a beszélő nemét, életkorát, testsúlyát, illetve testtömegét. A fizikai tulajdonságok becsléséhez a beszédből kinyert akusztikai jellemzőket használunk: prozódiai alapú, beszédminőség-alapú, spektrális alapú. Az eredmények azt mutatják, hogy a nem, a testtömeg és a testsúly becslése nagy pontosságú, míg az életkor becslése kevésbé

    Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

    In forensic voice comparison the speaker embedding has become widely popular in the last 10 years. Most of the pretrained speaker embeddings are trained on English corpora, because it is easily accessible. Thus, language dependency can be an important factor in automatic forensic voice comparison, especially when the target language is linguistically very different. There are numerous commercial systems available, but their models are mainly trained on a different language (mostly English) than the target language. In the case of a low-resource language, developing a corpus for forensic purposes containing enough speakers to train deep learning models is costly. This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language (here, Hungarian), different from the model is trained on. Also, often multiple samples are not available from the offender (unknown speaker). Therefore, samples are compared pairwise with and without speaker enrollment for suspect (known) speakers. Two corpora are applied that were developed especially for forensic purposes, and a third that is meant for traditional speaker verification. Two deep learning based speaker embedding vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker verification was evaluated in the likelihood-ratio framework. A comparison is made between the language combinations (modeling, LR calibration, evaluation). The results were evaluated by minCllr and EER metrics. It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch. The effect of sample durations and speaking styles were also examined. It was found that the longer the duration of the sample in question the better the performance is. Also, there is no real difference if various speaking styles are applied

    Kisiskolás gyermekek spontán beszédének jellemzői

    Az f0-jellemzők felolvasásban és spontán beszédben

    A large number of studies investigated the differences in f0 characteristics between reading aloud (RA) and spontaneous speech (SpS) in various languages. Their basic assumption was that the different production strategies lead to difference in the prosodic features, however, their results were not consistent as to which speech style was realized with higher mean f0 or a larger f0 range. Hungarian data have been only analyzed on small numbers of speakers. Therefore our goals are: (i) to provide a large sample (82 subjects) based comparison of the f0 characteristics of RA and SpS in Hungarian, and (ii) to analyze the individual differences behind general tendencies. Mean f0 and pitch range (of the interpausal units) were higher in RA, while the f0 range in SpS. The interspeaker differences played an important role in the mean f0 results: no speech style characteristic difference was found in women, while this was apparent in men