6 research outputs found

    Estimating Speaking Rate by Means of Rhythmicity Parameters

    Get PDF
    In this paper we present a speech rate estimator based on so-called rhythmicity features derived from a modified version of the short-time energy envelope. To evaluate the new method, it is compared to a traditional speech rate estimator on the basis of semi-automatic segmentation. Speech material from the Alcohol Language Corpus (ALC) covering intoxicated and sober speech of different speech styles provides a statistically sound foundation to test upon. The proposed measure clearly correlates with the semi-automatically determined speech rate and seems to be robust across speech styles and speaker states

    Comparación de dos métodos basados en la intensidad para el cålculo automåtico de la velocidad de habla

    Get PDF
    Automatic computation of speech rate is a necessary task in a wide range of applications that require this prosodic feature, in which a manual transcription and time alignments are not available. Several tools have been developed to this end, but not enough research has been conducted yet to see to what extent they are scalable to other languages. In the present work, we take two off-the- shelf tools designed for automatic speech rate computation and already tested for Dutch and English (v1, which relies on intensity peaks preceded by an intensity dip to find syllable nuclei and v3, which relies on intensity peaks surrounded by dips) and we apply them to read and spontaneous Spanish speech. Then, we test which of them offers the best performance. The results obtained with precision and normalized mean squared error metrics showed that v3 performs better than v1. However, recall measurement shows a better performance of v1, which suggests that a more fine-grained analysis on sensitivity and specificity is needed to select the best option depending on the application we are dealing with.El cĂĄlculo automĂĄtico de la velocidad de habla es una tarea fonĂ©tica Ăștil y que ademĂĄs se hace indispensable cuando no hay disponible una transcripciĂłn manual a partir de la cual determinar una tasa de habla manual. Se han desarrollado varias herramientas para este fin, pero todavĂ­a no se ha llevado a cabo suficiente investigaciĂłn para ver hasta quĂ© punto las herramientas son aplicables a lenguas distintas para las que fueron diseñadas. En este artĂ­culo probamos dos herramientas para el cĂĄlculo automĂĄtico de la velocidad de habla ya evaluadas para el neerlandĂ©s y el inglĂ©s (v1, que se basa en la determinaciĂłn de picos de intensidad precedidos de un valle para encontrar nĂșcleos de sĂ­laba, y v3, que se basa en picos de intensidad rodeados de valles) y las aplicamos a un corpus de habla leĂ­da y espontĂĄnea del español para analizar cuĂĄl ofrece mejores resultados en español. Los resultados de precisiĂłn y del error cuadrĂĄtico mediano normalizado obtenidos muestran que v3 funciona mejor que v1. No obstante, el recall muestra mejor rendimiento para la v1, lo que nos indica que se necesita un anĂĄlisis detallado de la sensibilidad y la especificidad para seleccionar la mejor opciĂłn en funciĂłn de los objetivos del anĂĄlisis posterior que se quiera hacer

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Untersuchungen der rhythmischen Struktur von Sprache unter Alkoholeinfluss

    Get PDF
    This thesis is concerned with the rhythmical structure of speech under the influence of alcohol. All analyses presented are based on the Alcohol Language Corpus, which is a collection of speech uttered by 77 female and 85 male sober and intoxicated speakers. Experimental research was carried out to find robust, automatically extractable features of the speech signal that indicate speaker intoxication. These features included rhythm measures, which reflect the durational variability of vocalic and consonantal elements and are normally used to classify languages into different rhythm classes. The durational variability was found to be greater in the speech of intoxicated individuals than in the speech of sober individuals, which suggests, that speech of intoxicated speakers is more irregular than speech of sober speakers. Another set of features describes the dynamics of the short-time energy function of speech. Therefore different measures are derived from a sequence of energy minima and maxima. The results also reveal a greater irregularity in the speech of intoxicated individuals. A separate investigation about speaking rate included two different measures. One is based on the phonetic segmentation and is an estimate of the number of syllables per second. The other is the mean duration of the time intervals between successive maxima of the short-time energy function of speech. Both measures denote a decreased speaking rate in the speech of intoxicated speakers compared to speech uttered in sober condition. The results of a perception experiment show that a decrease in speaking rate also is an indicator for intoxication in the perception of speech. The last experiment investigates rhythmical features based on the fundamental frequency and energy contours of speech signals. Contours are compared directly with different distance measures (root mean square error, statistical correlation and the Euclidean distance in the spectral space of the contours). They are also compared by parameterization of the contours using Discrete Cosine Transform and the first and second moments of the lower DCT spectrum. A Principal Components Analysis on the contour data was also carried out to find fundamental contour forms regarding the speech of intoxicated and sober individuals. Concerning the distance measures, contours of speech signals uttered by intoxicated speakers differ significantly from contours of speech signals uttered in sober condition. Parameterization of the contours showed that fundamental frequency contours of speech signals uttered by intoxicated speakers consist of faster movements and energy contours of speech signals uttered by intoxicated speakers of slower movements than the respective contours of speech signals uttered in sober condition. Principal Components Analysis did not find any interpretable fundamental contour forms that could help distinguishing contours of speech signals of intoxicated speakers from those of speech uttered in sober condition. All analyses prove that the effects of alcoholic intoxication on different features of speech cannot be generalized but are to a great extent speaker-dependent

    Foreigner-directed speech and L2 speech learning in an understudied interactional setting: the case of foreign-domestic helpers in Oman

    Get PDF
    Ph. D. (Integrated) ThesisSet in Arabic-speaking Oman, the present study investigates whether speech directed to foreign domestic helpers (FDH-directed speech) is modified when compared with speech addressed to native Arabic speakers. It also explores the FDH’s ability to learn the sound system of their L2 in a near-naturalistic setting. In relation to input, the study explores whether there are any adaptations in native speakers’ realizations of complex Arabic consonants, consonant clusters, and vowels in FDH-directed speech. By doing so, it compares the phonetic features of FDH-directed speech in relation to other speech registers such as foreigner-directed speech (FDS), infant-directed speech (IDS) and clear speech. The study also investigates whether foreign accentedness, religion and Arabic language experience, as indexed by length of residence (LoR), play a role in the extent of adaptations present in FDH-directed speech. In relation to L2 speech learning, the study investigates the extent to which FDHs are sensitive to the phonemic contrasts of Arabic and whether their production of complex Arabic consonants and consonant clusters is target-like. It also examines the social and linguistic factors (LoR, first and second language literacy) that play a role in the learnability of these sounds. Speech recordings were collected from 22 Omani female native Arabic speakers who interacted 1) with their FDHs and 2) with a native-speaking adult (the order was reversed for half of the participants), in both instances using a spot the difference task. A picture naming task was then used to collect data for production data by the same FDHs, while perception data consisted of an AX forced choice task. Results demonstrate the distinctiveness of FDH-directed speech from other speech registers. Neither simplification of complex sounds nor hyperarticulation of consonant contrasts were attested in FDH-directed speech, despite them being reported in other studies on FDS and IDS. We attribute this to the familiarity of the native speakers with their FDHs and the formulaic nature of their daily interactions. Expansion of vowel space was evident in this study, conforming with other FDS studies. Results from perception and production tasks revealed that FDHs fell short of native-like performance, despite the more naturalistic setting and regardless of LoR. L1 and L2 literacy played varying roles in FDHs’ phonological sensitivity and production of certain contrasts. The study is original is terms of showing that FDS is not an automatic outcome of interactions with L2 speakers and links these results with the unusual social setting