4,725 research outputs found
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community
In recent years, deep learning (DL), a re-branding of neural networks (NNs),
has risen to the top in numerous areas, namely computer vision (CV), speech
recognition, natural language processing, etc. Whereas remote sensing (RS)
possesses a number of unique challenges, primarily related to sensors and
applications, inevitably RS draws from many of the same theories as CV; e.g.,
statistics, fusion, and machine learning, to name a few. This means that the RS
community should be aware of, if not at the leading edge of, of advancements
like DL. Herein, we provide the most comprehensive survey of state-of-the-art
RS DL research. We also review recent new developments in the DL field that can
be used in DL for RS. Namely, we focus on theories, tools and challenges for
the RS community. Specifically, we focus on unsolved challenges and
opportunities as it relates to (i) inadequate data sets, (ii)
human-understandable solutions for modelling physical phenomena, (iii) Big
Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and
learning algorithms for spectral, spatial and temporal data, (vi) transfer
learning, (vii) an improved theoretical understanding of DL systems, (viii)
high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote
Sensin
Blind Restoration of Real-World Audio by 1D Operational GANs
Objective: Despite numerous studies proposed for audio restoration in the
literature, most of them focus on an isolated restoration problem such as
denoising or dereverberation, ignoring other artifacts. Moreover, assuming a
noisy or reverberant environment with limited number of fixed
signal-to-distortion ratio (SDR) levels is a common practice. However,
real-world audio is often corrupted by a blend of artifacts such as
reverberation, sensor noise, and background audio mixture with varying types,
severities, and duration. In this study, we propose a novel approach for blind
restoration of real-world audio signals by Operational Generative Adversarial
Networks (Op-GANs) with temporal and spectral objective metrics to enhance the
quality of restored audio signal regardless of the type and severity of each
artifact corrupting it. Methods: 1D Operational-GANs are used with generative
neuron model optimized for blind restoration of any corrupted audio signal.
Results: The proposed approach has been evaluated extensively over the
benchmark TIMIT-RAR (speech) and GTZAN-RAR (non-speech) datasets corrupted with
a random blend of artifacts each with a random severity to mimic real-world
audio signals. Average SDR improvements of over 7.2 dB and 4.9 dB are achieved,
respectively, which are substantial when compared with the baseline methods.
Significance: This is a pioneer study in blind audio restoration with the
unique capability of direct (time-domain) restoration of real-world audio
whilst achieving an unprecedented level of performance for a wide SDR range and
artifact types. Conclusion: 1D Op-GANs can achieve robust and computationally
effective real-world audio restoration with significantly improved performance.
The source codes and the generated real-world audio datasets are shared
publicly with the research community in a dedicated GitHub repository1
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
- …