1,825 research outputs found
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Effects of truncation in modal representations of thermal convection
The Galerkin (including single-mode and Lorenz) equations were examined for convection in a sphere to determine which physical processes are neglected when the equations of motion are truncated too severely. The conclusions were tested by calculating solutions to the equations of motion for different values of the Rayleigh number and for different values of the limit of the horizontal spatial resolution. It was shown that the transitions from steady state to periodic, then to aperiodic convection depend not only on the Rayleigh number but also very strongly on the horizontal resolution. One of the effects of truncation is to enhance the high wavenumber end of the kinetic energy and thermal variance spectra. The numerical examples indicate that as long as the kinetic energy spectrum decreases with wavenumber, a truncation gives a qualitatively correct solution
FMMRec: Fairness-aware Multimodal Recommendation
Recently, multimodal recommendations have gained increasing attention for
effectively addressing the data sparsity problem by incorporating
modality-based representations. Although multimodal recommendations excel in
accuracy, the introduction of different modalities (e.g., images, text, and
audio) may expose more users' sensitive information (e.g., gender and age) to
recommender systems, resulting in potentially more serious unfairness issues.
Despite many efforts on fairness, existing fairness-aware methods are either
incompatible with multimodal scenarios, or lead to suboptimal fairness
performance due to neglecting sensitive information of multimodal content. To
achieve counterfactual fairness in multimodal recommendations, we propose a
novel fairness-aware multimodal recommendation approach (dubbed as FMMRec) to
disentangle the sensitive and non-sensitive information from modal
representations and leverage the disentangled modal representations to guide
fairer representation learning. Specifically, we first disentangle biased and
filtered modal representations by maximizing and minimizing their sensitive
attribute prediction ability respectively. With the disentangled modal
representations, we mine the modality-based unfair and fair (corresponding to
biased and filtered) user-user structures for enhancing explicit user
representation with the biased and filtered neighbors from the corresponding
structures, followed by adversarially filtering out sensitive information.
Experiments on two real-world public datasets demonstrate the superiority of
our FMMRec relative to the state-of-the-art baselines. Our source code is
available at https://anonymous.4open.science/r/FMMRec
ViT-Lens: Towards Omni-modal Representations
Though the success of CLIP-based training recipes in vision-language models,
their scalability to more modalities (e.g., 3D, audio, etc.) is limited to
large-scale data, which is expensive or even inapplicable for rare modalities.
In this paper, we present ViT-Lens that facilitates efficient omni-modal
representation learning by perceiving novel modalities with a pretrained ViT
and aligning to a pre-defined space. Specifically, the modality-specific lens
is tuned to project multimodal signals to the shared embedding space, which are
then processed by a strong ViT that carries pre-trained image knowledge. The
encoded multimodal representations are optimized toward aligning with the
modal-independent space, pre-defined by off-the-shelf foundation models. A
well-trained lens with a ViT backbone has the potential to serve as one of
these foundation models, supervising the learning of subsequent modalities.
ViT-Lens provides a unified solution for representation learning of increasing
modalities with two appealing benefits: (i) Exploiting the pretrained ViT
across tasks and domains effectively with efficient data regime; (ii) Emergent
downstream capabilities of novel modalities are demonstrated due to the
modality alignment space. We evaluate ViT-Lens in the context of 3D as an
initial verification. In zero-shot 3D classification, ViT-Lens achieves
substantial improvements over previous state-of-the-art, showing 52.0% accuracy
on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,
we enable zero-shot 3D question-answering by simply integrating the trained 3D
lens into the InstructBLIP model without any adaptation. We will release the
results of ViT-Lens on more modalities in the near future.Comment: 19 pages, 4 figures and 9 table
Towards an Indexical Model of Situated Language Comprehension for Cognitive Agents in Physical Worlds
We propose a computational model of situated language comprehension based on
the Indexical Hypothesis that generates meaning representations by translating
amodal linguistic symbols to modal representations of beliefs, knowledge, and
experience external to the linguistic system. This Indexical Model incorporates
multiple information sources, including perceptions, domain knowledge, and
short-term and long-term experiences during comprehension. We show that
exploiting diverse information sources can alleviate ambiguities that arise
from contextual use of underspecific referring expressions and unexpressed
argument alternations of verbs. The model is being used to support linguistic
interactions in Rosie, an agent implemented in Soar that learns from
instruction.Comment: Advances in Cognitive Systems 3 (2014
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
Gated networks: an inventory
Gated networks are networks that contain gating connections, in which the
outputs of at least two neurons are multiplied. Initially, gated networks were
used to learn relationships between two input sources, such as pixels from two
images. More recently, they have been applied to learning activity recognition
or multi-modal representations. The aims of this paper are threefold: 1) to
explain the basic computations in gated networks to the non-expert, while
adopting a standpoint that insists on their symmetric nature. 2) to serve as a
quick reference guide to the recent literature, by providing an inventory of
applications of these networks, as well as recent extensions to the basic
architecture. 3) to suggest future research directions and applications.Comment: Unpublished manuscript, 17 page
- …