1,283 research outputs found

    On the development of an automatic voice pleasantness classification and intensity estimation system

    Get PDF
    In the last few years, the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems, the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how pleasant is a voice from a perceptual point of view when the final application is a speech based interface. In this paper we present an objective definition for voice pleasantness based on the composition of a representative feature subset and a new automatic voice pleasantness classification and intensity estimation system. Our study is based on a database composed by European Portuguese female voices but the methodology can be extended to male voices or to other languages. In the objective performance evaluation the system achieved a 9.1% error rate for voice pleasantness classification and a 15.7% error rate for voice pleasantness intensity estimation.Work partially supported by ERDF funds, the Spanish Government (TEC2009-14094-C04-04), and Xunta de Galicia (CN2011/019, 2009/062

    Text-Independent, Open-Set Speaker Recognition

    Get PDF
    Speaker recognition, like other biometric personal identification techniques, depends upon a person\u27s intrinsic characteristics. A realistically viable system must be capable of dealing with the open-set task. This effort attacks the open-set task, identifying the best features to use, and proposes the use of a fuzzy classifier followed by hypothesis testing as a model for text-independent, open-set speaker recognition. Using the TIMIT corpus and Rome Laboratory\u27s GREENFLAG tactical communications corpus, this thesis demonstrates that the proposed system succeeded in open-set speaker recognition. Considering the fact that extremely short utterances were used to train the system (compared to other closed-set speaker identification work), this system attained reasonable open-set classification error rates as low as 23% for TIMIT and 26% for GREENFLAG. Feature analysis identified the filtered linear prediction cepstral coefficients with or without the normalized log energy or pitch appended as a robust feature set (based on the 17 feature sets considered), well suited for clean speech and speech degraded by tactical communications channels

    Automated assessment of second language comprehensibility: Review, training, validation, and generalization studies

    Get PDF
    Whereas many scholars have emphasized the relative importance of comprehensibility as an ecologically valid goal for L2 speech training, testing, and development, eliciting listeners’ judgments is time-consuming. Following calls for research on more efficient L2 speech rating methods in applied linguistics, and growing attention toward using machine learning on spontaneous unscripted speech in speech engineering, the current study examined the possibility of establishing quick and reliable automated comprehensibility assessments. Orchestrating a set of phonological (maximum posterior probabilities and gaps between L1 and L2 speech), prosodic (pitch and intensity variation), and temporal measures (articulation rate, pause frequency), the regression model significantly predicted how naïve listeners intuitively judged low, mid, high, and nativelike comprehensibility among 100 L1 and L2 speakers’ picture descriptions. The strength of the correlation (r = .823 for machine vs. human ratings) was comparable to naïve listeners’ interrater agreement (r = .760 for humans vs. humans). The findings were successfully replicated when the model was applied to a new dataset of 45 L1 and L2 speakers (r = .827) and tested under a more freely constructed interview task condition (r = .809)

    The Effect of Shadowing in Learning L2 Segments: A Perspective from Phonetic Convergence

    Get PDF
    This study aimed to investigate the role that phonetic convergence plays in the acquisition of L2 segments. In particular, it examined whether phonetic convergence towards native speakers could help Arabic-speaking second-language (L2) learners of English improve their pronunciation of four problematic English segments (/p, v, ɛ, oʊ/). To do so, the study went through several phases of experimental studies. Phonetic convergence was first explored in the productions of Arabic L2 learners towards five different English native model talkers in non-interactive setting. Five XAB perceptual similarity judgments and acoustic measurements of VOT, vowel duration, F0, and F1*F2 were used to evaluate phonetic convergence.Based mainly on perceptual measures of phonetic convergence, learners were divided evenly between two groups. C-group (convergence group) received phonetic production training from the model talkers to whom they showed the highest degree of phonetic convergence, while D-group (divergence group) received training from the model talkers they showed divergence from or the least convergence to. Training lasted three consecutive days with target segments (i.e., /p, v, ɛ, oʊ/) presented in nonsense words. They were trained using the shadowing technique that used low-variability training paradigm in which each learner received training from one native model talker. Native-speaker judgments on segmental intelligibility indicated both groups showed significant improvement on the post-test; however, no significant differences were found between groups in terms of the overall magnitude of this change. Perceived convergence in learners’ speech failed to explain the improvement. However, some patterns of acoustic convergence towards their trainers, regardless of group, predicted the overall segmental intelligibility gains. The findings suggested that the more trainees converged their vowel duration and formants to their trainers, the more their performance improved. At featural level, the study examined the relationship between the preexisting phonetic distance between the Arabic L2 learners of English and model talkers before the exposure and the degree of convergence. Results indicated that there was a direct relationship between how far Arabic L2 learners were from the native model talkers and the degree of convergence in all measured acoustic features. That is, the greater the baseline distance, the greater the degree of phonetic convergence was. However, such a relationship might be due to the metric used to assess phonetic convergence. The relationship between phonetic convergence measured by difference in distance (DID) and the absolute baseline distance is always biased due to the way they are calculated (Cohen Priva & Sanker, 2019; MacLeod, 2021). This study found shadowing to be an effective technique to promote segmental intelligibility among Arabic-speakers learning English as an L2. However, this effectiveness might be increased by trainees converging more to their trainers in vowel duration and vowel spectra or being similar to their trainers in this regard from the beginning

    Rank-frequency relation for Chinese characters

    Full text link
    We show that the Zipf's law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf's law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.Comment: To appear in European Physical Journal B (EPJ B), 2014 (22 pages, 7 figures

    Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition

    Get PDF
    Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community

    Further Investigation of MDS as a Tool for Evaluation of Speech Quality of Synthesized Speech

    Get PDF
    The dissertation investigates MDS as a tool for the evaluation of the quality of synthesized speech. More specifically, it investigates the relations between Weighted Euclidean Distance Scaling and Simple Euclidean Distance Scaling, and how aggregating data affects the MDS configuration. It is investigated to what extent a subset of experimental participants and/or experimental stimuli are representative of a larger test set. For that purpose an experiment was conducted on the basis of a subset of stimuli used in the Blizzard Challenge 2008. Issues in the evaluation of Speech Synthesis are discussed and an overview of the basics of multi-dimensional scaling is given to an extent that allows comprehension of methods used in the application of Multi-dimensional scaling to speech synthesis evaluation. Based on the experimental findings, further experiments are suggested with the goal in mind that testing procedures can be optimized to such an extent that the number of experimental participants can be drastically reduced
    corecore