406 research outputs found

    Towards Cognizant Hearing Aids: Modeling of Content, Affect and Attention

    Get PDF

    Diffusion-Based Audio Inpainting

    Full text link
    Audio inpainting aims to reconstruct missing segments in corrupted recordings. Previous methods produce plausible reconstructions when the gap length is shorter than about 100\;ms, but the quality decreases for longer gaps. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, offering high flexibility to regenerate gaps of arbitrary length. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps. The results of a formal listening test show that the proposed method delivers a comparable performance against state-of-the-art for short gaps, while retaining a good audio quality and outperforming the baselines for the longest gap lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of sound recordings having fairly long local disturbances or dropouts, which must be reconstructed.Comment: Submitted for publication to the Journal of Audio Engineering Society on January 30th, 202

    Learning Sensory Representations with Minimal Supervision

    Get PDF

    Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

    Get PDF
    This paper describes a novel two-stage dereverberation feature enhancement method for noise-robust automatic speech recognition. In the first stage, an estimate of the dereverberated speech is generated by matching the distribution of the observed reverberant speech to that of clean speech, in a decorrelated transformation domain that has a long temporal context in order to address the effects of reverberation. The second stage uses this dereverberated signal as an initial estimate within a non-negative matrix factorization framework, which jointly estimates a sparse representation of the clean speech signal and an estimate of the convolutional distortion. The proposed feature enhancement method, when used in conjunction with automatic speech recognizer back-end processing, is shown to improve the recognition performance compared to three other state-of-the-art techniques

    Survey of deep representation learning for speech emotion recognition

    Get PDF
    Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER

    Designing the next generation intelligent transportation sensor system using big data driven machine learning techniques

    Get PDF
    Accurate traffic data collection is essential for supporting advanced traffic management system operations. This study investigated a large-scale data-driven sequential traffic sensor health monitoring (TSHM) module that can be used to monitor sensor health conditions over large traffic networks. Our proposed module consists of three sequential steps for detecting different types of abnormal sensor issues. The first step detects sensors with abnormally high missing data rates, while the second step uses clustering anomaly detection to detect sensors reporting abnormal records. The final step introduces a novel Bayesian changepoint modeling technique to detect sensors reporting abnormal traffic data fluctuations by assuming a constant vehicle length distribution based on average effective vehicle length (AEVL). Our proposed method is then compared with two benchmark algorithms to show its efficacy. Results obtained by applying our method to the statewide traffic sensor data of Iowa show it can successfully detect different classes of sensor issues. This demonstrates that sequential TSHM modules can help transportation agencies determine traffic sensors’ exact problems, thereby enabling them to take the required corrective steps. The second research objective will focus on the traffic data imputation after we discard the anomaly/missing data collected from failure traffic sensors. Sufficient high-quality traffic data are a crucial component of various Intelligent Transportation System (ITS) applications and research related to congestion prediction, speed prediction, incident detection, and other traffic operation tasks. Nonetheless, missing traffic data are a common issue in sensor data which is inevitable due to several reasons, such as malfunctioning, poor maintenance or calibration, and intermittent communications. Such missing data issues often make data analysis and decision-making complicated and challenging. In this study, we have developed a generative adversarial network (GAN) based traffic sensor data imputation framework (TSDIGAN) to efficiently reconstruct the missing data by generating realistic synthetic data. In recent years, GANs have shown impressive success in image data generation. However, generating traffic data by taking advantage of GAN based modeling is a challenging task, since traffic data have strong time dependency. To address this problem, we propose a novel time-dependent encoding method called the Gramian Angular Summation Field (GASF) that converts the problem of traffic time-series data generation into that of image generation. We have evaluated and tested our proposed model using the benchmark dataset provided by Caltrans Performance Management Systems (PeMS). This study shows that the proposed model can significantly improve the traffic data imputation accuracy in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) compared to state-of-the-art models on the benchmark dataset. Further, the model achieves reasonably high accuracy in imputation tasks even under a very high missing data rate (\u3e50%), which shows the robustness and efficiency of the proposed model. Besides the loop and radar sensors, traffic cameras have shown great ability to provide insightful traffic information using the image and video processing techniques. Therefore, the third and final part of this work aimed to introduce an end to end real-time cloud-enabled traffic video analysis (IVA) framework to support the development of the future smart city. As Artificial intelligence (AI) growing rapidly, Computer vision (CV) techniques are expected to significantly improve the development of intelligent transportation systems (ITS), which are anticipated to be a key component of future Smart City (SC) frameworks. Powered by computer vision techniques, the converting of existing traffic cameras into connected ``smart sensors called intelligent video analysis (IVA) systems has shown the great capability of producing insightful data to support ITS applications. However, developing such IVA systems for large-scale, real-time application deserves further study, as the current research efforts are focused more on model effectiveness instead of model efficiency. Therefore, we have introduced a real-time, large-scale, cloud-enabled traffic video analysis framework using NVIDIA DeepStream, which is a streaming analysis toolkit for AI-based video and image analysis. In this study, we have evaluated the technical and economic feasibility of our proposed framework to help traffic agency to build IVA systems more efficiently. Our study shows that the daily operating cost for our proposed framework on Google Cloud Platform (GCP) is less than $0.14 per camera, and that, compared with manual inspections, our framework achieves an average vehicle-counting accuracy of 83.7% on sunny days
    • …
    corecore