4,395 research outputs found

    Spectral Characteristics of Schwa in Czech Accented English

    Get PDF
    The English central mid lax vowel (i.e., schwa) often contributes considerably to the sound differences between native and non-native speech. Many foreign speakers of English fail to reduce certain underlying vowels to schwa, which, on the suprasegmental level of description, affects the perceived rhythm of their speech. However, the problem of capturing quantitatively the differences between native and non-native schwa poses difficulties that, to this day, have been tackled only partially. We offer a technique of measurement in the acoustic domain that has not been probed properly as yet: the distribution of acoustic energy in the vowel spectrum. Our results show that spectral slope features measured in weak vowels discriminate between Czech and British speakers of English quite reliably. Moreover, the measurements of formant bandwidths turned out to be useful for the same task, albeit less direc

    Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

    No full text
    In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011

    Acoustic-phonetic realisation of Polish syllable prominence: a corpus study.

    Get PDF
    Malisz Z, Wagner P. Acoustic-phonetic realisation of Polish syllable prominence: a corpus study. In: Gibbon D, Hirst D, Campbell N, eds. Rhythm, melody and harmony in speech. Studies in honour of Wiktor Jassem. Speech and Language Technology. Vol 14/15. Poznań, Poland; 2012: 105-114

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Acoustic characterization and perceptual analysis of the relative importance of prosody in speech of people with Down syndrome

    Get PDF
    There are many studies that identify important deficits in the voice production of people with Down syndrome. These deficits affect not only the spectral domain, but also the intonation, accent, rhythm and speech rate. The main aim of this work is the identication of the acoustic features that characterize the speech of people with Down syndrome, taking into account the different frequency, energy, temporal and spectral domains. The comparison of the relative weight of these features for the characterization of Down syndrome people's speech is another aim of this study. The openSmile toolkit with the GeMAPS feature set was used to extract acoustic features from a speech corpus of utterances from typically developing individuals and individuals with Down syndrome. Then, the most discriminant features were identied using statistical tests. Moreover, three binary classiers were trained using these features. The best classication rate, using only spectral features, is 87.33%, and using frequency, energy and temporal features, it is 91.83%. Finally, a perception test has been performed using recordings created with a prosody transfer algorithm: the prosody of utterances from one group of speakers was transferred to utterances of another group. The results of this test show the importance of intonation and rhythm in the identication of a voice as non typical. As conclusion, the results obtained point to the training of prosody in order to improve the quality of the speech production of those with Down syndrome

    Daily Stress Recognition from Mobile Phone Data, Weather Conditions and Individual Traits

    Full text link
    Research has proven that stress reduces quality of life and causes many diseases. For this reason, several researchers devised stress detection systems based on physiological parameters. However, these systems require that obtrusive sensors are continuously carried by the user. In our paper, we propose an alternative approach providing evidence that daily stress can be reliably recognized based on behavioral metrics, derived from the user's mobile phone activity and from additional indicators, such as the weather conditions (data pertaining to transitory properties of the environment) and the personality traits (data concerning permanent dispositions of individuals). Our multifactorial statistical model, which is person-independent, obtains the accuracy score of 72.28% for a 2-class daily stress recognition problem. The model is efficient to implement for most of multimedia applications due to highly reduced low-dimensional feature space (32d). Moreover, we identify and discuss the indicators which have strong predictive power.Comment: ACM Multimedia 2014, November 3-7, 2014, Orlando, Florida, US

    Perception of English /i/ and /I/ by Japanese and Spanish Listeners: Longitudinal Results

    Get PDF
    Flege’s Speech Learning Model predicts that if an L2 learner perceives an L2 speech sound as similar to an L1 speech sound, the two sounds will be combined as a diaphone category, the properties of which will eventually be intermediate between the properties of the L1 and L2 sound. In contrast if the L2 sound is perceived as new, then a new category will be established with properties which may eventually match the properties of the L2 sound. Canadian English has two high front vowels: tense /i/ and lax /I/ differing in spectral and duration properties. Japanese has two high front vowels: long /i:/ and short /i/ differing in duration only. English /i/ and /I/ are expected to be perceived as similar to Japanese /i:/ and /i/, and Japanese learners of English are predicted to establish diaphone categories. Their identification of English /i/ and /I/ is predicted to initially match their perception of Japanese /i:/ and /i/, but eventually be intermediate between the native norms for the L1 and L2 categories. Spanish has one high front vowel. Spanish learners of English are predicted to perceive English /I/ as less similar to Spanish /i/ than English /i/, and are predicted to eventually establish a new /i:/ category. Their identification of English /i/ and /I/ is predicted to initially be poor but eventually match that of English listeners. These predictions were tested using a multidimensional edited-speech continuum covering the English words /bIt bit bId bid/. Properties which varied in the continuum included vowel spectral properties and vowel duration. A longitudinal study was conducted testing Japanese and Spanish speaking learners of English one month and six months after their arrival in Canada. Japanese listeners were found to have a primarily duration-based categorical boundary between English /i/ and /I/ which did not change between the initial and final tests. Spanish listeners did not have a categorical identification pattern in the initial test, but they did establish duration-based or spectrally-based categorical boundaries by the time of the final test. Results were therefore consistent with the theoretical predictions

    Tone classification of syllable -segmented Thai speech based on multilayer perceptron

    Get PDF
    Thai is a monosyllabic and tonal language. Thai makes use of tone to convey lexical information about the meaning of a syllable. Thai has five distinctive tones and each tone is well represented by a single F0 contour pattern. In general, a Thai syllable with a different tone has a different lexical meaning. Thus, to completely recognize a spoken Thai syllable, a speech recognition system has not only to recognize a base syllable but also to correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system.;In this study, a tone classification of syllable-segmented Thai speech which incorporates the effects of tonal coarticulation, stress and intonation was developed. Automatic syllable segmentation, which performs the segmentation on the training and test utterances into syllable units, was also developed. The acoustical features including fundamental frequency (F0), duration, and energy extracted from the processing syllable and neighboring syllables were used as the main discriminating features. A multilayer perceptron (MLP) trained by backpropagation method was employed to classify these features. The proposed system was evaluated on 920 test utterances spoken by five male and three female Thai speakers who also uttered the training speech. The proposed system achieved an average accuracy rate of 91.36%
    corecore