1,486 research outputs found

    Store-and-forward based methods for the signal control problem in large-scale congested urban road networks

    Get PDF
    The problem of designing network-wide traffic signal control strategies for large-scale congested urban road networks is considered. One known and two novel methodologies, all based on the store-and-forward modeling paradigm, are presented and compared. The known methodology is a linear multivariable feedback regulator derived through the formulation of a linear-quadratic optimal control problem. An alternative, novel methodology consists of an open-loop constrained quadratic optimal control problem, whose numerical solution is achieved via quadratic programming. Yet a different formulation leads to an open-loop constrained nonlinear optimal control problem, whose numerical solution is achieved by use of a feasible-direction algorithm. A preliminary simulation-based investigation of the signal control problem for a large-scale urban road network using these methodologies demonstrates the comparative efficiency and real-time feasibility of the developed signal control methods

    Text-based Editing of Talking-head Video

    No full text
    Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

    Machine Learning for Hand Gesture Classification from Surface Electromyography Signals

    Get PDF
    Classifying hand gestures from Surface Electromyography (sEMG) is a process which has applications in human-machine interaction, rehabilitation and prosthetic control. Reduction in the cost and increase in the availability of necessary hardware over recent years has made sEMG a more viable solution for hand gesture classification. The research challenge is the development of processes to robustly and accurately predict the current gesture based on incoming sEMG data. This thesis presents a set of methods, techniques and designs that improve upon evaluation of, and performance on, the classification problem as a whole. These are brought together to set a new baseline for the potential classification. Evaluation is improved by careful choice of metrics and design of cross-validation techniques that account for data bias caused by common experimental techniques. A landmark study is re-evaluated with these improved techniques, and it is shown that data augmentation can be used to significantly improve upon the performance using conventional classification methods. A novel neural network architecture and supporting improvements are presented that further improve performance and is refined such that the network can achieve similar performance with many fewer parameters than competing designs. Supporting techniques such as subject adaptation and smoothing algorithms are then explored to improve overall performance and also provide more nuanced trade-offs with various aspects of performance, such as incurred latency and prediction smoothness. A new study is presented which compares the performance potential of medical grade electrodes and a low-cost commercial alternative showing that for a modest-sized gesture set, they can compete. The data is also used to explore data labelling in experimental design and to evaluate the numerous aspects of performance that must be traded off

    A temporal-to-spatial neural network for classification of hand movements from electromyography data

    Get PDF
    Deep convolutional neural networks (CNNs) are appealing for the purpose of classification of hand movements from surface electromyography (sEMG) data because they have the ability to perform automated person-specific feature extraction from raw data. In this paper, we make the novel contribution of proposing and evaluating a design for the early processing layers in the deep CNN for multichannel sEMG. Specifically, we propose a novel temporal-to-spatial (TtS) CNN architecture, where the first layer performs convolution separately on each sEMG channel to extract temporal features. This is motivated by the idea that sEMG signals in each channel are mediated by one or a small subset of muscles, whose temporal activation patterns are associated with the signature features of a gesture. The temporal layer captures these signature features for each channel separately, which are then spatially mixed in successive layers to recognise a specific gesture. A practical advantage is that this approach also makes the CNN simple to design for different sample rates. We use NinaPro database 1 (27 subjects and 52 movements + rest), sampled at 100 Hz, and database 2 (40 subjects and 40 movements + rest), sampled at 2 kHz, to evaluate our proposed CNN design. We benchmark against a feature-based support vector machine (SVM) classifier, two CNNs from the literature, and an additional standard design of CNN. We find that our novel TtS CNN design achieves 66.6% per-class accuracy on database 1, and 67.8% on database 2, and that the TtS CNN outperforms all other compared classifiers using a statistical hypothesis test at the 2% significance level

    A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

    Get PDF
    During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

    Advances in Probabilistic Deep Learning

    Get PDF
    This thesis is concerned with methodological advances in probabilistic inference and their application to core challenges in machine perception and AI. Inferring a posterior distribution over the parameters of a model given some data is a central challenge that occurs in many fields ranging from finance and artificial intelligence to physics. Exact calculation is impossible in all but the simplest cases and a rich field of approximate inference has been developed to tackle this challenge. This thesis develops both an advance in approximate inference and an application of these methods to the problem of speech synthesis. In the first section of this thesis we develop a novel framework for constructing Markov Chain Monte Carlo (MCMC) kernels that can efficiently sample from high dimensional distributions such as the posteriors, that frequently occur in machine perception. We provide a specific instance of this framework and demonstrate that it can match or exceed the performance of Hamiltonian Monte Carlo without requiring gradients of the target distribution. In the second section of the thesis we focus on the application of approximate inference techniques to the task of synthesising human speech from text. By using advances in neural variational inference we are able to construct a state of the art speech synthesis system in which it is possible to control aspects of prosody such as emotional expression from significantly less supervised data than previously existing state of the art methods

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality
    corecore