411 research outputs found
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use
In audio classification, differentiable auditory filterbanks with few
parameters cover the middle ground between hard-coded spectrograms and raw
audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with
Per-Channel Energy Normalization (PCEN), has shown promising results, but is
computationally expensive. With inhomogeneous convolution kernel sizes and
strides, and by replacing PCEN with better parallelizable operations, we can
reach similar results more efficiently. In experiments on six audio
classification tasks, our frontend matches the accuracy of LEAF at 3% of the
cost, but both fail to consistently outperform a fixed mel filterbank. The
quest for learnable audio frontends is not solved.Comment: Accepted at EUSIPCO 2022. Code at
https://github.com/CPJKU/EfficientLEA
Curved Gabor Filters for Fingerprint Image Enhancement
Gabor filters play an important role in many application areas for the
enhancement of various types of images and the extraction of Gabor features.
For the purpose of enhancing curved structures in noisy images, we introduce
curved Gabor filters which locally adapt their shape to the direction of flow.
These curved Gabor filters enable the choice of filter parameters which
increase the smoothing power without creating artifacts in the enhanced image.
In this paper, curved Gabor filters are applied to the curved ridge and valley
structure of low-quality fingerprint images. First, we combine two orientation
field estimation methods in order to obtain a more robust estimation for very
noisy images. Next, curved regions are constructed by following the respective
local orientation and they are used for estimating the local ridge frequency.
Lastly, curved Gabor filters are defined based on curved regions and they are
applied for the enhancement of low-quality fingerprint images. Experimental
results on the FVC2004 databases show improvements of this approach in
comparison to state-of-the-art enhancement methods
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
AM-FM methods for image and video processing
This dissertation is focused on the development of robust and efficient Amplitude-Modulation Frequency-Modulation (AM-FM) demodulation methods for image and video processing (there is currently a patent pending that covers the AM-FM methods and applications described in this dissertation). The motivation for this research lies in the wide number of image and video processing applications that can significantly benefit from this research. A number of potential applications are developed in the dissertation. First, a new, robust and efficient formulation for the instantaneous frequency (IF) estimation: a variable spacing, local quadratic phase method (VS-LQP) is presented. VS-LQP produces much more accurate results than current AM-FM methods. At significant noise levels (SNR \u3c 30dB), for single component images, the VS-LQP method produces better IF estimation results than methods using a multi-scale filterbank. At low noise levels (SNR \u3e 50dB), VS-LQP performs better when used in combination with a multi-scale filterbank. In all cases, VS-LQP outperforms the Quasi-Eigen Approximation algorithm by significant amounts (up to 20dB). New least squares reconstructions using AM-FM components from the input signal (image or video) are also presented. Three different reconstruction approaches are developed: (i) using AM-FM harmonics, (ii) using AM-FM components extracted from different scales and (iii) using AM-FM harmonics with the output of a low-pass filter. The image reconstruction methods provide perceptually lossless results with image quality index values bigger than 0.7 on average. The video reconstructions produced image quality index values, frame by frame, up to more than 0.7 using AM-FM components extracted from different scales. An application of the AM-FM method to retinal image analysis is also shown. This approach uses the instantaneous frequency magnitude and the instantaneous amplitude (IA) information to provide image features. The new AM-FM approach produced ROC area of 0.984 in classifying Risk 0 versus Risk 1, 0.95 in classifying Risk 0 versus Risk 2, 0.973 in classifying Risk 0 versus Risk 3 and 0.95 in classifying Risk 0 versus all images with any sign of Diabetic Retinopathy. An extension of the 2D AM-FM demodulation methods to three dimensions is also presented. New AM-FM methods for motion estimation are developed. The new motion estimation method provides three motion estimation equations per channel filter (AM, IF motion equations and a continuity equation). Applications of the method in motion tracking, trajectory estimation and for continuous-scale video searching are demonstrated. For each application, we discuss the advantages of the AM-FM methods over current approaches
- …