4,454 research outputs found
Audio-visual foreground extraction for event characterization
This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage, and coupled with an audio BG/FG modelling scheme. The audiovisual association is performed on-line, by exploiting the concept of synchrony. Experimental tests carrying out classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by using the single modalities
Decision-Making with Heterogeneous Sensors - A Copula Based Approach
Statistical decision making has wide ranging applications, from communications and signal processing to econometrics and finance. In contrast to the classical one source-one receiver paradigm, several applications have been identified in the recent past that require acquiring data from multiple sources or sensors. Information from the multiple sensors are transmitted to a remotely located receiver known as the fusion center which makes a global decision. Past work has largely focused on fusion of information from homogeneous sensors. This dissertation extends the formulation to the case when the local sensors may possess disparate sensing modalities. Both the theoretical and practical aspects of multimodal signal processing are considered.
The first and foremost challenge is to \u27adequately\u27 model the joint statistics of such heterogeneous sensors. We propose the use of copula theory for this purpose. Copula models are general descriptors of dependence. They provide a way to characterize the nonlinear functional relationships between the multiple modalities, which are otherwise difficult to formalize. The important problem of selecting the `best\u27 copula function from a given set of valid copula densities is addressed, especially in the context of binary hypothesis testing problems. Both, the training-testing paradigm, where a training set is assumed to be available for learning the copula models prior to system deployment, as well as generalized likelihood ratio test (GLRT) based fusion rule for the online selection and estimation of copula parameters are considered. The developed theory is corroborated with extensive computer simulations as well as results on real-world data.
Sensor observations (or features extracted thereof) are most often quantized before their transmission to the fusion center for bandwidth and power conservation. A detection scheme is proposed for this problem assuming unifom scalar quantizers at each sensor. The designed rule is applicable for both binary and multibit local sensor decisions. An alternative suboptimal but computationally efficient fusion rule is also designed which involves injecting a deliberate disturbance to the local sensor decisions before fusion. The rule is based on Widrow\u27s statistical theory of quantization. Addition of controlled noise helps to \u27linearize\u27 the higly nonlinear quantization process thus resulting in computational savings. It is shown that although the introduction of external noise does cause a reduction in the received signal to noise ratio, the proposed approach can be highly accurate when the input signals have bandlimited characteristic functions, and the number of quantization levels is large.
The problem of quantifying neural synchrony using copula functions is also investigated. It has been widely accepted that multiple simultaneously recorded electroencephalographic signals exhibit nonlinear and non-Gaussian statistics. While the existing and popular measures such as correlation coefficient, corr-entropy coefficient, coh-entropy and mutual information are limited to being bivariate and hence applicable only to pairs of channels, measures such as Granger causality, even though multivariate, fail to account for any nonlinear inter-channel dependence. The application of copula theory helps alleviate both these limitations. The problem of distinguishing patients with mild cognitive impairment from the age-matched control subjects is also considered. Results show that the copula derived synchrony measures when used in conjunction with other synchrony measures improve the detection of Alzheimer\u27s disease onset
Improving acoustic vehicle classification by information fusion
We present an information fusion approach for ground vehicle classification based on the emitted acoustic signal. Many acoustic factors can contribute to the classification accuracy of working ground vehicles. Classification relying on a single feature set may lose some useful information if its underlying sound production model is not comprehensive. To improve classification accuracy, we consider an information fusion diagram, in which various aspects of an acoustic signature are taken into account and emphasized separately by two different feature extraction methods. The first set of features aims to represent internal sound production, and a number of harmonic components are extracted to characterize the factors related to the vehicle’s resonance. The second set of features is extracted based on a computationally effective discriminatory analysis, and a group of key frequency components are selected by mutual information, accounting for the sound production from the vehicle’s exterior parts. In correspondence with this structure, we further put forward a modifiedBayesian fusion algorithm, which takes advantage of matching each specific feature set with its favored classifier. To assess the proposed approach, experiments are carried out based on a data set containing acoustic signals from different types of vehicles. Results indicate that the fusion approach can effectively increase classification accuracy compared to that achieved using each individual features set alone. The Bayesian-based decision level fusion is found fusion is found to be improved than a feature level fusion approac
Multimodal Fusion Interactions: A Study of Human and Automatic Quantification
In order to perform multimodal fusion of heterogeneous signals, we need to
understand their interactions: how each modality individually provides
information useful for a task and how this information changes in the presence
of other modalities. In this paper, we perform a comparative study of how
humans annotate two categorizations of multimodal interactions: (1) partial
labels, where different annotators annotate the label given the first, second,
and both modalities, and (2) counterfactual labels, where the same annotator
annotates the label given the first modality before asking them to explicitly
reason about how their answer changes when given the second. We further propose
an alternative taxonomy based on (3) information decomposition, where
annotators annotate the degrees of redundancy: the extent to which modalities
individually and together give the same predictions, uniqueness: the extent to
which one modality enables a prediction that the other does not, and synergy:
the extent to which both modalities enable one to make a prediction that one
would not otherwise make using individual modalities. Through experiments and
annotations, we highlight several opportunities and limitations of each
approach and propose a method to automatically convert annotations of partial
and counterfactual labels to information decomposition, yielding an accurate
and efficient method for quantifying multimodal interactions.Comment: International Conference on Multimodal Interaction (ICMI '23), Code
available at: https://github.com/pliang279/PID. arXiv admin note: text
overlap with arXiv:2302.1224
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Can feature information interaction help for information fusion in multimedia problems?
This article presents the information-theoretic based feature information interaction, a measure that can describe complex feature dependencies in multivariate settings. According to the theoretical development, feature interactions are more accurate than current, bivariate dependence measures due to their stable and unambiguous definition. In experiments with artificial and real data we compare first the empirical dependency estimates of correlation, mutual information and 3-way feature interaction. Then, we present feature selection and classification experiments that show superior performance of interactions over bivariate dependence measures for the artificial data, for real world data this goal is not achieved ye
Neuro-Inspired Hierarchical Multimodal Learning
Integrating and processing information from various sources or modalities are
critical for obtaining a comprehensive and accurate perception of the real
world. Drawing inspiration from neuroscience, we develop the
Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the
concept of information bottleneck. Distinct from most traditional fusion models
that aim to incorporate all modalities as input, our model designates the prime
modality as input, while the remaining modalities act as detectors in the
information pathway. Our proposed perception model focuses on constructing an
effective and compact information flow by achieving a balance between the
minimization of mutual information between the latent state and the input modal
state, and the maximization of mutual information between the latent states and
the remaining modal states. This approach leads to compact latent state
representations that retain relevant information while minimizing redundancy,
thereby substantially enhancing the performance of downstream tasks.
Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate
that our model consistently distills crucial information in multimodal learning
scenarios, outperforming state-of-the-art benchmarks
- …