4 research outputs found

    Singing speaker clustering based on subspace learning in the GMM mean supervector space

    Get PDF
    Abstract In this study, we propose algorithms based on subspace learning in the GMM mean supervector space to improve performance of speaker clustering with speech from both reading and singing. As a speaking style, singing introduces changes in the time-frequency structure of a speaker's voice. The purpose of this study is to introduce advancements for speech systems such as speech indexing and retrieval which improve robustness to intrinsic variations in speech production. Speaker clustering techniques such as k-means and hierarchical are explored for analysis of acoustic space differences of a corpus consisting of reading and singing of lyrics for each speaker. Furthermore, a distance based on fuzzy c-means membership degrees is proposed to more accurately measure clustering difficulty or speaker confusability. Two categories of subspace learning methods are studied: unsupervised based on LPP, and supervised based on PLDA. Our proposed clustering method based on PLDA is a two stage algorithm: where first, initial clusters are obtained using full dimension supervectors, and next, each cluster is refined in a PLDA subspace resulting in a more speaker dependent representation that is less sensitive to speaking style. It is shown that LPP improves average clustering accuracy by 5.1% absolute versus a hierarchical baseline for a mixture of reading and singing, and PLDA based clustering increases accuracy by 9.6% absolute versus a k-means baseline. The advancements offer novel techniques to improve model formulation for speech applications including speaker ID, audio search, and audio content analysis

    Speech data analysis for semantic indexing of video of simulated medical crises.

    Get PDF
    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker

    On-line quality monitoring and lifetime prediction of thick Al wire bonds using signals obtained from ultrasonic generator

    Get PDF
    Abstract The reliable performance of power electronic modules has been a concern for many years due to their increased use in applications which demand high availability and longer lifetimes. Thick Al wire bonding is a key technique for providing interconnections in power electronic modules. Today, wire bond lift-off and heel cracking are often considered the most lifetime limiting factors of power electronic modules as a result of cyclic thermomechanical stresses. Therefore, it is important for power electronic packaging manufacturers to address this issue at the design stage and on the manufacturing line. Techniques for the non-destructive, real-time evaluation and control of wire bond quality have been proposed to detect defects in manufacture and predict reliability prior to in-service exposure. This approach has the potential to improve the accuracy of lifetime prediction for the manufactured product. In this thesis, a non-destructive technique for detecting bond quality by the application of a semi-supervised classification algorithm to process signals obtained from an ultrasonic generator is presented. Experimental tests verified that the classification method is capable of accurately predicting bond quality, indicated by bonded area as measured by X-ray tomography. Samples classified during bonding were subjected to both passive and active cycling and the distribution of bond life amongst the different classes analysed. It is demonstrated that the as-bonded quality classification is closely correlated with cycling life and can therefore be used as a non-destructive tool for monitoring bond quality and predicting useful service life

    On-line quality monitoring and lifetime prediction of thick Al wire bonds using signals obtained from ultrasonic generator

    Get PDF
    Abstract The reliable performance of power electronic modules has been a concern for many years due to their increased use in applications which demand high availability and longer lifetimes. Thick Al wire bonding is a key technique for providing interconnections in power electronic modules. Today, wire bond lift-off and heel cracking are often considered the most lifetime limiting factors of power electronic modules as a result of cyclic thermomechanical stresses. Therefore, it is important for power electronic packaging manufacturers to address this issue at the design stage and on the manufacturing line. Techniques for the non-destructive, real-time evaluation and control of wire bond quality have been proposed to detect defects in manufacture and predict reliability prior to in-service exposure. This approach has the potential to improve the accuracy of lifetime prediction for the manufactured product. In this thesis, a non-destructive technique for detecting bond quality by the application of a semi-supervised classification algorithm to process signals obtained from an ultrasonic generator is presented. Experimental tests verified that the classification method is capable of accurately predicting bond quality, indicated by bonded area as measured by X-ray tomography. Samples classified during bonding were subjected to both passive and active cycling and the distribution of bond life amongst the different classes analysed. It is demonstrated that the as-bonded quality classification is closely correlated with cycling life and can therefore be used as a non-destructive tool for monitoring bond quality and predicting useful service life
    corecore