411 research outputs found

    Learning to detect dysarthria from raw speech

    Full text link
    Speech classifiers of paralinguistic traits traditionally learn from diverse hand-crafted low-level features, by selecting the relevant information for the task at hand. We explore an alternative to this selection, by learning jointly the classifier, and the feature extraction. Recent work on speech recognition has shown improved performance over speech features by learning from the waveform. We extend this approach to paralinguistic classification and propose a neural network that can learn a filterbank, a normalization factor and a compression power from the raw speech, jointly with the rest of the architecture. We apply this model to dysarthria detection from sentence-level audio recordings. Starting from a strong attention-based baseline on which mel-filterbanks outperform standard low-level descriptors, we show that learning the filters or the normalization and compression improves over fixed features by 10% absolute accuracy. We also observe a gain over OpenSmile features by learning jointly the feature extraction, the normalization, and the compression factor with the architecture. This constitutes a first attempt at learning jointly all these operations from raw audio for a speech classification task.Comment: 5 pages, 3 figures, submitted to ICASS

    EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use

    Full text link
    In audio classification, differentiable auditory filterbanks with few parameters cover the middle ground between hard-coded spectrograms and raw audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with Per-Channel Energy Normalization (PCEN), has shown promising results, but is computationally expensive. With inhomogeneous convolution kernel sizes and strides, and by replacing PCEN with better parallelizable operations, we can reach similar results more efficiently. In experiments on six audio classification tasks, our frontend matches the accuracy of LEAF at 3% of the cost, but both fail to consistently outperform a fixed mel filterbank. The quest for learnable audio frontends is not solved.Comment: Accepted at EUSIPCO 2022. Code at https://github.com/CPJKU/EfficientLEA

    Curved Gabor Filters for Fingerprint Image Enhancement

    Full text link
    Gabor filters play an important role in many application areas for the enhancement of various types of images and the extraction of Gabor features. For the purpose of enhancing curved structures in noisy images, we introduce curved Gabor filters which locally adapt their shape to the direction of flow. These curved Gabor filters enable the choice of filter parameters which increase the smoothing power without creating artifacts in the enhanced image. In this paper, curved Gabor filters are applied to the curved ridge and valley structure of low-quality fingerprint images. First, we combine two orientation field estimation methods in order to obtain a more robust estimation for very noisy images. Next, curved regions are constructed by following the respective local orientation and they are used for estimating the local ridge frequency. Lastly, curved Gabor filters are defined based on curved regions and they are applied for the enhancement of low-quality fingerprint images. Experimental results on the FVC2004 databases show improvements of this approach in comparison to state-of-the-art enhancement methods

    Learning Audio Sequence Representations for Acoustic Event Classification

    Full text link
    Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a 'hand-crafted' manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this paper, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC

    Time-frequency distributions for automatic speech recognition

    Full text link

    AM-FM methods for image and video processing

    Get PDF
    This dissertation is focused on the development of robust and efficient Amplitude-Modulation Frequency-Modulation (AM-FM) demodulation methods for image and video processing (there is currently a patent pending that covers the AM-FM methods and applications described in this dissertation). The motivation for this research lies in the wide number of image and video processing applications that can significantly benefit from this research. A number of potential applications are developed in the dissertation. First, a new, robust and efficient formulation for the instantaneous frequency (IF) estimation: a variable spacing, local quadratic phase method (VS-LQP) is presented. VS-LQP produces much more accurate results than current AM-FM methods. At significant noise levels (SNR \u3c 30dB), for single component images, the VS-LQP method produces better IF estimation results than methods using a multi-scale filterbank. At low noise levels (SNR \u3e 50dB), VS-LQP performs better when used in combination with a multi-scale filterbank. In all cases, VS-LQP outperforms the Quasi-Eigen Approximation algorithm by significant amounts (up to 20dB). New least squares reconstructions using AM-FM components from the input signal (image or video) are also presented. Three different reconstruction approaches are developed: (i) using AM-FM harmonics, (ii) using AM-FM components extracted from different scales and (iii) using AM-FM harmonics with the output of a low-pass filter. The image reconstruction methods provide perceptually lossless results with image quality index values bigger than 0.7 on average. The video reconstructions produced image quality index values, frame by frame, up to more than 0.7 using AM-FM components extracted from different scales. An application of the AM-FM method to retinal image analysis is also shown. This approach uses the instantaneous frequency magnitude and the instantaneous amplitude (IA) information to provide image features. The new AM-FM approach produced ROC area of 0.984 in classifying Risk 0 versus Risk 1, 0.95 in classifying Risk 0 versus Risk 2, 0.973 in classifying Risk 0 versus Risk 3 and 0.95 in classifying Risk 0 versus all images with any sign of Diabetic Retinopathy. An extension of the 2D AM-FM demodulation methods to three dimensions is also presented. New AM-FM methods for motion estimation are developed. The new motion estimation method provides three motion estimation equations per channel filter (AM, IF motion equations and a continuity equation). Applications of the method in motion tracking, trajectory estimation and for continuous-scale video searching are demonstrated. For each application, we discuss the advantages of the AM-FM methods over current approaches
    corecore