855 research outputs found

    Automation of the Spoken Poetry Rhyming Game in Persian

    Get PDF
    This paper aims to investigate how a Persian spoken poetry game, called Mosha'ere, can be computerized by using a Persian automatic speech recognition system trained with read speech. To do this, the text and recitation speech of the poetries of the great poets, Hafez and Sa'di, were gathered. A spoken poetry rhyming game called Chakame, was developed. It utilizes a context-dependent tri-phone HMM acoustic modeling trained by Persian read speech with normal speed to recognize beyts, i.e., lines of verses, spoken by a human user. Chakame was evaluated against two kinds of recitation speech: 100 beyts recited formally at the normal rate and another 100 beyts recited emotionally hyperarticulated at a slow rate. About 23% difference in WER shows the impact of the intrinsic features of emotional recitation speech of verses on recognition rate. However, an overall beyt recognition rate of 98.5% was obtained for Chekame

    Inequity in Popular Voice Recognition Systems Regarding African Accents

    Get PDF
    With new age speakers such as the Echo Dot and Google Home, everyone should have equal opportunity to use them. Yet, for many popular voice recognition systems, the only accents that have wide support are those from Europe, Latin America, and Asia. This can be frustrating for users who have dialects or accents which are poorly understood by common tools like Amazon's Alexa. As such devices become more like household appliances, researchers are becoming increasingly aware of bias and inequity in Speech Recognition, as well as other sub-fields of Artificial Intelligence. The addition of African accents can potentially diversify smart speaker customer bases worldwide. My research project can help developers include accents from the African diaspora as they build these systems. In this work, we measure recognition accuracy for under-represented dialects across a variety of speech recognition systems and analyze the results in terms of standard performance metrics. After collecting audio files from different voices across the African diaspora, we discuss key findings and generate guidelines for developing an implementation for current voice recognition systems that are more fair for all

    Automatic Content Generation for Video Self Modeling

    Get PDF
    Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him or herself. Its effectiveness in rehabilitation and education has been repeatedly demonstrated but technical challenges remain in creating video contents that depict previously unseen behaviors. In this paper, we propose a novel system that re-renders new talking-head sequences suitable to be used for VSM treatment of patients with voice disorder. After the raw footage is captured, a new speech track is either synthesized using text-to-speech or selected based on voice similarity from a database of clean speeches. Voice conversion is then applied to match the new speech to the original voice. Time markers extracted from the original and new speech track are used to re-sample the video track for lip synchronization. We use an adaptive re-sampling strategy to minimize motion jitter, and apply bilinear and optical-flow based interpolation to ensure the image quality. Both objective measurements and subjective evaluations demonstrate the effectiveness of the proposed techniques

    EVALUATION OF INTELLIGIBILITY AND SPEAKER SIMILARITY OF VOICE TRANSFORMATION

    Get PDF
    Voice transformation refers to a class of techniques that modify the voice characteristics either to conceal the identity or to mimic the voice characteristics of another speaker. Its applications include automatic dialogue replacement and voice generation for people with voice disorders. The diversity in applications makes evaluation of voice transformation a challenging task. The objective of this research is to propose a framework to evaluate intentional voice transformation techniques. Our proposed framework is based on two fundamental qualities: intelligibility and speaker similarity. Intelligibility refers to the clarity of the speech content after voice transformation and speaker similarity measures how well the modified output disguises the source speaker. We measure intelligibility with word error rates and speaker similarity with likelihood of identifying the correct speaker. The novelty of our approach is, we consider whether similarly transformed training data are available to the recognizer. We have demonstrated that this factor plays a significant role in intelligibility and speaker similarity for both human testers and automated recognizers. We thoroughly test two classes of voice transformation techniques: pitch distortion and voice conversion, using our proposed framework. We apply our results for patients with voice hypertension using video self-modeling and preliminary results are presented

    Neural processing of poems and songs is based on melodic properties

    Get PDF
    The neural processing of speech and music is still a matter of debate. A long tradition that assumes shared processing capacities for the two domains contrasts with views that assume domain-specific processing. We here contribute to this topic by investigating, in a functional magnetic imaging (fMRI) study, ecologically valid stimuli that are identical in wording and differ only in that one group is typically spoken (or silently read), whereas the other is sung: poems and their respective musical settings. We focus on the melodic properties of spoken poems and their sung musical counterparts by looking at proportions of significant autocorrelations (PSA) based on pitch values extracted from their recordings. Following earlier studies, we assumed a bias of poem-processing towards the left and a bias for song-processing on the right hemisphere. Furthermore, PSA values of poems and songs were expected to explain variance in left- vs. right-temporal brain areas, while continuous liking ratings obtained in the scanner should modulate activity in the reward network. Overall, poem processing compared to song processing relied on left temporal regions, including the superior temporal gyrus, whereas song processing compared to poem processing recruited more right temporal areas, including Heschl's gyrus and the superior temporal gyrus. PSA values co-varied with activation in bilateral temporal regions for poems, and in right-dominant fronto-temporal regions for songs. Continuous liking ratings were correlated with activity in the default mode network for both poems and songs. The pattern of results suggests that the neural processing of poems and their musical settings is based on their melodic properties, supported by bilateral temporal auditory areas and an additional right fronto-temporal network known to be implicated in the processing of melodies in songs. These findings take a middle ground in providing evidence for specific processing circuits for speech and music in the left and right hemisphere, but simultaneously for shared processing of melodic aspects of both poems and their musical settings in the right temporal cortex. Thus, we demonstrate the neurobiological plausibility of assuming the importance of melodic properties in spoken and sung aesthetic language alike, along with the involvement of the default mode network in the aesthetic appreciation of these properties

    Security in Voice Authentication

    Get PDF
    We evaluate the security of human voice password databases from an information theoretical point of view. More specifically, we provide a theoretical estimation on the amount of entropy in human voice when processed using the conventional GMM-UBM technologies and the MFCCs as the acoustic features. The theoretical estimation gives rise to a methodology for analyzing the security level in a corpus of human voice. That is, given a database containing speech signals, we provide a method for estimating the relative entropy (Kullback-Leibler divergence) of the database thereby establishing the security level of the speaker verification system. To demonstrate this, we analyze the YOHO database, a corpus of voice samples collected from 138 speakers and show that the amount of entropy extracted is less than 14-bits. We also present a practical attack that succeeds in impersonating the voice of any speaker within the corpus with a 98% success probability with as little as 9 trials. The attack will still succeed with a rate of 62.50% if 4 attempts are permitted. Further, based on the same attack rationale, we mount an attack on the ALIZE speaker verification system. We show through experimentation that the attacker can impersonate any user in the database of 69 people with about 25% success rate with only 5 trials. The success rate can achieve more than 50% by increasing the allowed authentication attempts to 20. Finally, when the practical attack is cast in terms of an entropy metric, we find that the theoretical entropy estimate almost perfectly predicts the success rate of the practical attack, giving further credence to the theoretical model and the associated entropy estimation technique

    The Influence of Social Context on Communication and Restricted and Repetitive Behaviors in Autism

    Get PDF
    Two of the most salient features of autism spectrum disorder (ASD) are impairments in communication and engagement in restricted and repetitive behaviors (RRBs). The goal of this study was to identify the effects of social context on both the occurrence of RRBs and social language performance in children with ASD. In this study, we defined the social context of a situation based on the primary focus (object or conversation) and the initiator of the interaction (child or experimenter). We performed a frequency count of RRBs as well as a mean length of utterance (MLU) analysis for play tasks with variations in focus and initiator. These measurements indicated that RRBs were lower in object-focused and child-initiated tasks; however, these situations also revealed a lower MLU. MLUs were higher for child-initiated tasks than experimenter-initiated tasks and for conversation tasks than object-focused tasks. These results imply that the type of tasks that are effective in lowering RRBs may not lend themselves to the further development of interpersonal communication skills. In order to develop more effective therapy options, it is important to understand the purpose of RRBs to find effective ways to reduce them while also increasing communication skills
    corecore