536 research outputs found

    Intelligent system for spoken term detection using the belief combination

    Get PDF
    Spoken Term Detection (STD) can be considered as a sub-part of the automatic speech recognition which aims to extract the partial information from speech signals in the form of query utterances. A variety of STD techniques available in the literature employ a single source of evidence for the query utterance match/mismatch determination. In this manuscript, we develop an acoustic signal processing based approach for STD that incorporates a number of techniques for silence removal, dynamic noise filtration, and evidence combination using Dempster-Shafer Theory (DST). A ‘spectral-temporal features based voiced segment detection’ and ‘energy and zero cross rate based unvoiced segment detection’ are built to remove the silence segments in the speech signal. Comprehensive experiments have been performed on large speech datasets and consequently satisfactory results have been achieved with the proposed approach. Our approach improves the existing speaker dependent STD approaches, specifically the reliability of query utterance spotting by combining the evidences from multiple belief sources

    Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour

    Full text link
    Rapport, the close and harmonious relationship in which interaction partners are "in sync" with each other, was shown to result in smoother social interactions, improved collaboration, and improved interpersonal outcomes. In this work, we are first to investigate automatic prediction of low rapport during natural interactions within small groups. This task is challenging given that rapport only manifests in subtle non-verbal signals that are, in addition, subject to influences of group dynamics as well as inter-personal idiosyncrasies. We record videos of unscripted discussions of three to four people using a multi-view camera system and microphones. We analyse a rich set of non-verbal signals for rapport detection, namely facial expressions, hand motion, gaze, speaker turns, and speech prosody. Using facial features, we can detect low rapport with an average precision of 0.7 (chance level at 0.25), while incorporating prior knowledge of participants' personalities can even achieve early prediction without a drop in performance. We further provide a detailed analysis of different feature sets and the amount of information contained in different temporal segments of the interactions.Comment: 12 pages, 6 figure

    Development of a Two-Level Warping Algorithm and Its Application to Speech Signal Processing

    Get PDF
    In many different fields there are signals that need to be aligned or “warped” in order to measure the similarity between them. When two time signals are compared, or when a pattern is sought in a larger stream of data, it may be necessary to warp one of the signals in a nonlinear way by compressing or stretching it to fit the other. Simple point-to-point comparison may give inadequate results, because one part of the signal might be comparing different relative parts of the other signal/pattern. Such cases need some sort of alignment todo the comparison. Dynamic Time Warping (DTW) is a powerful and widely used technique of time series analysis which performs such nonlinear warping in temporal domain. The work in this dissertation develops in two directions. The first direction is to extend the this dynamic time warping to produce a two-level dynamic warping algorithm, with warping in both temporal and spectral domains. While there have been hundreds of research efforts in the last two decades that have applied and used the one-dimensional warping process idea between time series, extending DTW method to two or more dimensions poses a more involved problem. The two-dimensional dynamic warping algorithm developed here for a variety of speech signal processing is ideally suited. The second direction is focused on two speech signal applications. The First application is the evaluation of dysarthric speech. Dysarthria is a neurological motor speech disorder, which characterized by spectral and temporal degradation in speech production. Dysarthria management has focused primarily teaching patients to improve their ability to produce speech or strategies to compensate for their deficits. However, many individuals with dysarthria are not well-suited for traditional speaker-oriented intervention. Recent studies have shown that speech intelligibility can be improved by training the listener to better understand the degraded speech signal. A computer-based training tool was developed using a two-level dynamic warping algorithm to eventually be incorporated into a program that trains listeners to learn to imitate dysarthric speech by providing subjects with feedback about the accuracy of their imitation attempts during training. The second application is voice transformation. Voice transformation techniques aims to modify a subject’s voice characteristics to make them sound like someone else, for example from a male speaker to female speaker. The approach taken here avoids the need to find acoustic parameters as many voice transformation methods do, and instead deals directly with spectral information. Based on the two-Level DW it is straightforward to map the source speech to target speech when both are available. The resulted spectral warping signal produced as described above introduces significant processing artifacts. Phase reconstruction was applied to the transformed signal to improve the quality of the final sound. Neural networks are trained to perform the voice transformation

    Time Series Classification with HIVE-COTE: The Hierarchical Vote Collective of Transformation-based Ensembles

    Get PDF
    A recent experimental evaluation assessed 19 time series classification (TSC) algorithms and found that one was significantly more accurate than all others: the Flat Collective of Transformation-based Ensembles (Flat-COTE). Flat-COTE is an ensemble that combines 35 classifiers over four data representations. However, while comprehensive, the evaluation did not consider deep learning approaches. Convolutional neural networks (CNN) have seen a surge in popularity and are now state of the art in many fields and raises the question of whether CNNs could be equally transformative for TSC. We implement a benchmark CNN for TSC using a common structure and use results from a TSC-specific CNN from the literature. We compare both to Flat-COTE and find that the collective is significantly more accurate than both CNNs. These results are impressive, but Flat-COTE is not without deficiencies. We significantly improve the collective by proposing a new hierarchical structure with probabilistic voting, defining and including two novel ensemble classifiers built in existing feature spaces, and adding further modules to represent two additional transformation domains. The resulting classifier, the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE), encapsulates classifiers built on five data representations. We demonstrate that HIVE-COTE is significantly more accurate than Flat-COTE (and all other TSC algorithms that we are aware of) over 100 resamples of 85 TSC problems and is the new state of the art for TSC. Further analysis is included through the introduction and evaluation of 3 new case studies and extensive experimentation on 1000 simulated datasets of 5 different types

    Zero-resource audio-only spoken term detection based on a combination of template matching techniques

    Get PDF
    spoken term detection, template matching, unsupervised learning, posterior featuresInternational audienceSpoken term detection is a well-known information retrieval task that seeks to extract contentful information from audio by locating occurrences of known query words of interest. This paper describes a zero-resource approach to such task based on pattern matching of spoken term queries at the acoustic level. The template matching module comprises the cascade of a segmental variant of dynamic time warping and a self-similarity matrix comparison to further improve robustness to speech variability. This solution notably differs from more traditional train and test methods that, while shown to be very accurate, rely upon the availability of large amounts of linguistic resources. We evaluate our framework on different parameterizations of the speech templates: raw MFCC features and Gaussian posteriorgrams, French and English phonetic posteriorgrams output by two different state of the art phoneme recognizers
    corecore