462 research outputs found
Challenges in video based object detection in maritime scenario using computer vision
This paper discusses the technical challenges in maritime image processing
and machine vision problems for video streams generated by cameras. Even well
documented problems of horizon detection and registration of frames in a video
are very challenging in maritime scenarios. More advanced problems of
background subtraction and object detection in video streams are very
challenging. Challenges arising from the dynamic nature of the background,
unavailability of static cues, presence of small objects at distant
backgrounds, illumination effects, all contribute to the challenges as
discussed here
Total Variation Regularized Tensor RPCA for Background Subtraction from Compressive Measurements
Background subtraction has been a fundamental and widely studied task in
video analysis, with a wide range of applications in video surveillance,
teleconferencing and 3D modeling. Recently, motivated by compressive imaging,
background subtraction from compressive measurements (BSCM) is becoming an
active research task in video surveillance. In this paper, we propose a novel
tensor-based robust PCA (TenRPCA) approach for BSCM by decomposing video frames
into backgrounds with spatial-temporal correlations and foregrounds with
spatio-temporal continuity in a tensor framework. In this approach, we use 3D
total variation (TV) to enhance the spatio-temporal continuity of foregrounds,
and Tucker decomposition to model the spatio-temporal correlations of video
background. Based on this idea, we design a basic tensor RPCA model over the
video frames, dubbed as the holistic TenRPCA model (H-TenRPCA). To characterize
the correlations among the groups of similar 3D patches of video background, we
further design a patch-group-based tensor RPCA model (PG-TenRPCA) by joint
tensor Tucker decompositions of 3D patch groups for modeling the video
background. Efficient algorithms using alternating direction method of
multipliers (ADMM) are developed to solve the proposed models. Extensive
experiments on simulated and real-world videos demonstrate the superiority of
the proposed approaches over the existing state-of-the-art approaches.Comment: To appear in IEEE TI
Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System
This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of
this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded.
In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments.
Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks
are made along with proposals for future work
Model-Based Speech Enhancement
Abstract
A method of speech enhancement is developed that reconstructs clean speech from
a set of acoustic features using a harmonic plus noise model of speech. This is a significant
departure from traditional filtering-based methods of speech enhancement.
A major challenge with this approach is to estimate accurately the acoustic features
(voicing, fundamental frequency, spectral envelope and phase) from noisy speech.
This is achieved using maximum a-posteriori (MAP) estimation methods that operate
on the noisy speech. In each case a prior model of the relationship between the
noisy speech features and the estimated acoustic feature is required. These models
are approximated using speaker-independent GMMs of the clean speech features
that are adapted to speaker-dependent models using MAP adaptation and for noise
using the Unscented Transform.
Objective results are presented to optimise the proposed system and a set of subjective
tests compare the approach with traditional enhancement methods. Threeway
listening tests examining signal quality, background noise intrusiveness and
overall quality show the proposed system to be highly robust to noise, performing
significantly better than conventional methods of enhancement in terms of background
noise intrusiveness. However, the proposed method is shown to reduce signal
quality, with overall quality measured to be roughly equivalent to that of the Wiener
filter
Compressive Sensing for Background Subtraction
Compressive sensing (CS) is an emerging field that provides a framework for image recovery using sub-Nyquist sampling rates. The CS theory shows that a signal can be reconstructed from a small set of random projections, provided that the signal is sparse in some basis, e.g., wavelets. In this paper, we describe a method to directly recover background subtracted images using CS and discuss its applications in some communication constrained, multi-camera computer vision problems. We show how to apply the CS theory to recover object silhouettes (binary background subtracted images) when the objects of interest occupy a small portion of the camera view, i.e., when they are sparse in the spatial domain. We cast the background subtraction as a sparse approximation problem and provide different solutions based on convex optimization and total variation. In our method, as opposed to learning the background, we learn and adapt a low dimensional compressed representation of it, which is sufficient to determine spatial innovations; object silhouettes are then estimated directly using the compressive samples without any auxiliary image reconstruction. We also discuss simultaneous appearance recovery of the objects using compressive measurements. In this case, we show that it may be necessary to reconstruct one auxiliary image. To demonstrate the performance of the proposed algorithm, we provide results on data captured using a compressive single-pixel camera. We also illustrate that our approach is suitable for image coding in communication constrained problems by using data captured by multiple conventional cameras to provide 2D tracking and 3D shape reconstruction results with compressive measurements
RADIFUSION: A multi-radiomics deep learning based breast cancer risk prediction model using sequential mammographic images with image attention and bilateral asymmetry refinement
Breast cancer is a significant public health concern and early detection is
critical for triaging high risk patients. Sequential screening mammograms can
provide important spatiotemporal information about changes in breast tissue
over time. In this study, we propose a deep learning architecture called
RADIFUSION that utilizes sequential mammograms and incorporates a linear image
attention mechanism, radiomic features, a new gating mechanism to combine
different mammographic views, and bilateral asymmetry-based finetuning for
breast cancer risk assessment. We evaluate our model on a screening dataset
called Cohort of Screen-Aged Women (CSAW) dataset. Based on results obtained on
the independent testing set consisting of 1,749 women, our approach achieved
superior performance compared to other state-of-the-art models with area under
the receiver operating characteristic curves (AUCs) of 0.905, 0.872 and 0.866
in the three respective metrics of 1-year AUC, 2-year AUC and > 2-year AUC. Our
study highlights the importance of incorporating various deep learning
mechanisms, such as image attention, radiomic features, gating mechanism, and
bilateral asymmetry-based fine-tuning, to improve the accuracy of breast cancer
risk assessment. We also demonstrate that our model's performance was enhanced
by leveraging spatiotemporal information from sequential mammograms. Our
findings suggest that RADIFUSION can provide clinicians with a powerful tool
for breast cancer risk assessment.Comment: v
The removal of environmental noise in cellular communications by perceptual techniques
This thesis describes the application of a perceptually based spectral subtraction algorithm for
the enhancement of non-stationary noise corrupted speech. Through examination of speech enhancement
techniques, explanations are given for the choice of magnitude spectral subtraction
and how the human auditory system can be modelled for frequency domain speech enhancement.
It is discovered, that the cochlea provides the mechanical speech enhancement in the
auditory system, through the use of masking. Frequency masking is used in spectral subtraction,
to improve the algorithm execution time, and to shape the enhancement process making it
sound natural to the ear.
A new technique for estimation of background noise is presented, which operates during speech
sections as well as pauses. This uses two microphones placed on opposite ends of the cellular
handset. Using these, the algorithm determines whether the signal is speech, or noise, by
examining the current and next frames presented to each microphone. This allows operation in
non-stationary conditions, as the estimation is calculated for each frame, and a speech pause is
not required for updating. A voting decision process decides the presence of speech or noise
which determines which microphone the estimation is calculated from.
The importance of an accurate noise estimate is highlighted with a new technique to reduce
the effect of musical noise artifacts in the processed speech. This is a classic drawback of
spectral subtraction techniques, and it is shown, that the trade off between noise reduction and
speech distortion can be extended by this process. A new method for dealing with musical
noise is described, which uses a combination of energy and variance examination of the spectrogram
to segregate potential musical noise from desired speech sections. By examination of
the spectrogram points surrounding musical noise sections, perceptually relevant values replace
the corruption leading to cleaner enhanced speech.
Any perceptual speech system requires accurate estimates of the clean speech masking thresholds,
to prevent noisy sections being passed through the enhancement untouched. In this thesis, a
method for the calculation of the estimated clean speech masking thresholds is derived. Classically,
this requires an estimation of the clean speech before the thresholds can be derived,
but this results in inaccuracy due to the presence of musical noise and spectral nulls. The
new algorithm examines the thresholds produced by the corrupted speech, and the background
noise, and from these determines the relationship between the two, to produce an estimate of
the clean thresholds, with no operation performed on the actual speech signal. A discrepancy is
found between the results for male and female speech, which, by examination of the perceptual
process, is shown to be due to the different formant positions in male and female speech.
Following the development of these parts, the entire enhancement algorithm is tested on a range
of noise scenarios, using male and female speech. The results show, that the proposed algorithm
is able to provide adequate performance in terms of noise reduction and speech quality
- …