1,167 research outputs found

    ‘Did the speaker change?’: Temporal tracking for overlapping speaker segmentation in multi-speaker scenarios

    Get PDF
    Diarization systems are an essential part of many speech processing applications, such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker domains. This thesis will focus on the first task of the diarization process, that being the task of speaker segmentation which can be thought of as trying to answer the question ‘Did the speaker change?’ in an audio recording. This thesis starts by showing that time-varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. It is then highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently, it is shown that if the pitch is not predictable, then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection. This thesis then goes on to demonstrate how voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This thesis then extends this work to explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) and direction of arrival (DoA) of each speaker simultaneously. The proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. Lastly, this thesis focuses on the DoA estimation part of the newly proposed multimodal approach. It does this by exploring a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluating its performance when using speech sound sources.Open Acces

    Deep Cellular Recurrent Neural Architecture for Efficient Multidimensional Time-Series Data Processing

    Get PDF
    Efficient processing of time series data is a fundamental yet challenging problem in pattern recognition. Though recent developments in machine learning and deep learning have enabled remarkable improvements in processing large scale datasets in many application domains, most are designed and regulated to handle inputs that are static in time. Many real-world data, such as in biomedical, surveillance and security, financial, manufacturing and engineering applications, are rarely static in time, and demand models able to recognize patterns in both space and time. Current machine learning (ML) and deep learning (DL) models adapted for time series processing tend to grow in complexity and size to accommodate the additional dimensionality of time. Specifically, the biologically inspired learning based models known as artificial neural networks that have shown extraordinary success in pattern recognition, tend to grow prohibitively large and cumbersome in the presence of large scale multi-dimensional time series biomedical data such as EEG. Consequently, this work aims to develop representative ML and DL models for robust and efficient large scale time series processing. First, we design a novel ML pipeline with efficient feature engineering to process a large scale multi-channel scalp EEG dataset for automated detection of epileptic seizures. With the use of a sophisticated yet computationally efficient time-frequency analysis technique known as harmonic wavelet packet transform and an efficient self-similarity computation based on fractal dimension, we achieve state-of-the-art performance for automated seizure detection in EEG data. Subsequently, we investigate the development of a novel efficient deep recurrent learning model for large scale time series processing. For this, we first study the functionality and training of a biologically inspired neural network architecture known as cellular simultaneous recurrent neural network (CSRN). We obtain a generalization of this network for multiple topological image processing tasks and investigate the learning efficacy of the complex cellular architecture using several state-of-the-art training methods. Finally, we develop a novel deep cellular recurrent neural network (CDRNN) architecture based on the biologically inspired distributed processing used in CSRN for processing time series data. The proposed DCRNN leverages the cellular recurrent architecture to promote extensive weight sharing and efficient, individualized, synchronous processing of multi-source time series data. Experiments on a large scale multi-channel scalp EEG, and a machine fault detection dataset show that the proposed DCRNN offers state-of-the-art recognition performance while using substantially fewer trainable recurrent units

    Predictive-State Decoders: Encoding the Future into Recurrent Networks

    Full text link
    Recurrent neural networks (RNNs) are a vital modeling technique that rely on internal states learned indirectly by optimization of a supervised, unsupervised, or reinforcement training loss. RNNs are used to model dynamic processes that are characterized by underlying latent states whose form is often unknown, precluding its analytic representation inside an RNN. In the Predictive-State Representation (PSR) literature, latent state processes are modeled by an internal state representation that directly models the distribution of future observations, and most recent work in this area has relied on explicitly representing and targeting sufficient statistics of this probability distribution. We seek to combine the advantages of RNNs and PSRs by augmenting existing state-of-the-art recurrent neural networks with Predictive-State Decoders (PSDs), which add supervision to the network's internal state representation to target predicting future observations. Predictive-State Decoders are simple to implement and easily incorporated into existing training pipelines via additional loss regularization. We demonstrate the effectiveness of PSDs with experimental results in three different domains: probabilistic filtering, Imitation Learning, and Reinforcement Learning. In each, our method improves statistical performance of state-of-the-art recurrent baselines and does so with fewer iterations and less data.Comment: NIPS 201

    Adapting the bidirectional Kalman filter for use in multi-frequency Time-of-flight range imaging

    Get PDF
    Time-of-flight range imaging cameras obtain range by producing amplitude modulated light and measuring the time taken for light to travel to the scene and back to the camera. Time-of-flight cameras require at least three raw measurements to calculate range. Raw frames are captured sequentially, and as such, motion in scenes during capture leads to inconsistent raw frame measurements and erroneous range calculations. Motion error constrains Time-of-flight cameras to non-dynamic scenes and limits their potential applications. The Time-of-flight bidirectional Kalman filter method is a state-of-the-art method known to reduce error due to transverse motion in cameras operating with a single modulation frequency. The method works by treating raw frames as a noisy time series and running the Kalman filter on it to produce a range estimation at every raw frame. The Kalman filter is then reapplied to the data in reverse to produce another set of range estimations, and a composite range is selected from the two set of range estimations. A number of commercial timeof-flight cameras, such as the Microsoft Kinect V2, use multiple modulation frequencies. In this thesis, we adapt the bidirectional Kalman filter method to multi-frequency operated cameras by having the prediction component of the Kalman filter take into account the change in amplitude and phase shift caused by the change in frequency. The amplitude component of the prediction is performed linearly by multiplication, while the phase shift component of the prediction is performed using the ratio of the modulation frequencies. The phase shift prediction across modulation frequencies requires the phase to be unwrapped. The unwrapping is performed between modulation frequencies by selecting the number of phase wraps that best predicts the two following raw frames. Finally, to ensure correct composite phase selection, an alternative method for selecting the composite phase is proposed for the adapted bidirectional Kalman filter. We perform quantitative and qualitative experiments to test the proposed method. In the quantitative experiment, the proposed method produces less error than the classical Discrete Fourier Transform approach in 70% of tested instances. The qualitative experiment shows that the proposed method significantly reduces motion blur

    Estimation of the Temporal Response Function and Tracking Selective Auditory Attention using Deep Kalman Filter

    Get PDF
    The cocktail party effect refers to the phenomenon that people can focus on a single sound source in a noisy environment with multiple speakers talking at the same time. This effect reflects the human brain's ability of selective auditory attention, whose decoding from non-invasive electroencephalogram (EEG) or magnetoencephalography (MEG) has recently been a topic of active research. The mapping between auditory stimuli and their neural responses can be measured by the auditory temporal response functions (TRF). It has been shown that the TRF estimates derived with the envelopes of speech streams and auditory neural responses can be used to make predictions that discriminate between attended and unattended speakers. l_1 regularized least squares estimation has been adopted in previous research for the estimation of the linear TRF model. However, most real-world systems exhibit a degree of non-linearity. We thus have to use new models for complex, realistic auditory environments. In this thesis, we proposed to estimate TRFs with the deep Kalman filter model, for the cases where the observations are a noisy, non-linear function of the latent states. The deep Kalman filter (DKF) algorithm is developed by referring to the techniques in variational inference. Replacing all the linear transformations in the classic Kalman filter model with non-linear transformations makes the posterior distribution intractable to compute due to the non-linearity. Thus, a recognition network is introduced to approximate the intractable posterior and optimize the variational lower bound of the objective function. We implemented the deep Kalman filter model with a two-layer Bidirectional LSTM and a MLP. The performance is first evaluated by applying our algorithm to simulated MEG data. In addition, we also combined the new model for TRF estimation with a previously proposed framework by replacing the dynamic encoding/decoding module in the framework with a deep Kalman filter to conduct real-time tracking of selective auditory attention. This performance is validated by applying the general framework to simulated EEG data

    Advances in Intelligent Vehicle Control

    Get PDF
    This book is a printed edition of the Special Issue Advances in Intelligent Vehicle Control that was published in the journal Sensors. It presents a collection of eleven papers that covers a range of topics, such as the development of intelligent control algorithms for active safety systems, smart sensors, and intelligent and efficient driving. The contributions presented in these papers can serve as useful tools for researchers who are interested in new vehicle technology and in the improvement of vehicle control systems
    corecore