242 research outputs found
Multi-view Representation Learning for Unifying Languages, Knowledge and Vision
The growth of content on the web has raised various challenges, yet also provided numerous opportunities. Content exists in varied forms such as text appearing in different languages, entity-relationship graph represented as structured knowledge and as a visual embodiment like images/videos. They are often referred to as modalities. In many instances, the different amalgamation of modalities co-exists to complement each other or to provide consensus. Thus making the content either heterogeneous or homogeneous. Having an additional point of view for each instance in the content is beneficial for data-driven learning and intelligent content processing. However, despite having availability of such content. Most advancements made in data-driven learning (i.e., machine learning) is by solving tasks separately for the single modality. The similar endeavor was not shown for the challenges which required input either from all or subset of them.
In this dissertation, we develop models and techniques that can leverage multiple views of heterogeneous or homogeneous content and build a shared representation for aiding several applications which require a combination of modalities mentioned above. In particular, we aim to address applications such as content-based search, categorization, and generation by providing several novel contributions.
First, we develop models for heterogeneous content by jointly modeling diverse representations emerging from two views depicting text and image by learning their correlation. To be specific, modeling such correlation is helpful to retrieve cross-modal content. Second, we replace the heterogeneous content with homogeneous to learn a common space representation for content categorization across languages. Furthermore, we develop models that take input from both homogeneous and heterogeneous content to facilitate the construction of common space representation from more than two views. Specifically, representation is used to generate one view from another. Lastly, we describe a model that can handle missing views, and demonstrate that the model can generate missing views by utilizing external knowledge. We argue that techniques the models leverage internally provide many practical benefits and lot of immediate value applications.
From the modeling perspective, our contributed model design in this thesis can be summarized under the phrase Multi-view Representation Learning( MVRL ). These models are variations and extensions of shallow statistical and deep neural networks approaches that can jointly optimize and exploit all views of the input content arising from different independent representations. We show that our models advance state of the art, but not limited to tasks such as cross-modal retrieval, cross-language text classification, image-caption generation in multiple languages and caption generation for images containing unseen visual object categories
Representational similarity analysis reveals commonalities and differences in the semantic processing of words and objects.
Understanding the meanings of words and objects requires the activation of underlying conceptual representations. Semantic representations are often assumed to be coded such that meaning is evoked regardless of the input modality. However, the extent to which meaning is coded in modality-independent or amodal systems remains controversial. We address this issue in a human fMRI study investigating the neural processing of concepts, presented separately as written words and pictures. Activation maps for each individual word and picture were used as input for searchlight-based multivoxel pattern analyses. Representational similarity analysis was used to identify regions correlating with low-level visual models of the words and objects and the semantic category structure common to both. Common semantic category effects for both modalities were found in a left-lateralized network, including left posterior middle temporal gyrus (LpMTG), left angular gyrus, and left intraparietal sulcus (LIPS), in addition to object- and word-specific semantic processing in ventral temporal cortex and more anterior MTG, respectively. To explore differences in representational content across regions and modalities, we developed novel data-driven analyses, based on k-means clustering of searchlight dissimilarity matrices and seeded correlation analysis. These revealed subtle differences in the representations in semantic-sensitive regions, with representations in LIPS being relatively invariant to stimulus modality and representations in LpMTG being uncorrelated across modality. These results suggest that, although both LpMTG and LIPS are involved in semantic processing, only the functional role of LIPS is the same regardless of the visual input, whereas the functional role of LpMTG differs for words and objects.This work was supported by the European Research CouncilThis is the final version of an article originally published in the Journal of Neuroscience and available online at http://www.jneurosci.org/content/33/48/18906.abstract
Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.
The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system
Music emotion recognition: a multimodal machine learning approach
Music emotion recognition (MER) is an emerging domain of the Music Information Retrieval (MIR) scientific community, and besides, music searches through emotions are one of the major selection preferred by web users. As the world goes to digital, the musical contents in online databases, such as Last.fm have expanded exponentially, which require substantial manual efforts for managing them and also keeping them updated. Therefore, the demand for innovative and adaptable search mechanisms, which can be personalized according to users’ emotional state, has gained increasing consideration in recent years. This thesis concentrates on addressing music emotion recognition problem by presenting several classification models, which were fed by textual features, as well as audio attributes extracted from the music. In this study, we build both supervised and semisupervised classification designs under four research experiments, that addresses the emotional role of audio features, such as tempo, acousticness, and energy, and also the impact of textual features extracted by two different approaches, which are TF-IDF and Word2Vec. Furthermore, we proposed a multi-modal approach by using a combined feature-set consisting of the features from the audio content, as well as from context-aware data. For this purpose, we generated a ground truth dataset containing over 1500 labeled song lyrics and also unlabeled big data, which stands for more than 2.5 million Turkish documents, for achieving to generate an accurate automatic emotion classification system. The analytical models were conducted by adopting several algorithms on the crossvalidated data by using Python. As a conclusion of the experiments, the best-attained performance was 44.2% when employing only audio features, whereas, with the usage of textual features, better performances were observed with 46.3% and 51.3% accuracy scores considering supervised and semi-supervised learning paradigms, respectively. As of last, even though we created a comprehensive feature set with the combination of audio and textual features, this approach did not display any significant improvement for classification performanc
Enhancing the Performance of Text Mining
The amount of text data produced in science, finance, social media, and medicine is growing at an unprecedented pace. The raw text data typically introduces major computational and analytical obstacles (e.g., extremely high dimensionality) to data mining and machine learning algorithms. Besides, the growth in the size of text data makes the search process more difficult for information retrieval systems, making retrieving relevant results to match the users’ search queries challenging. Moreover, the availability of text data in different languages creates the need to develop new methods to analyze multilingual topics to help policymakers in governmental and health systems to make risk decisions and to create policies to respond to public health crises, natural disasters, and political or social movements. The goal of this thesis is to develop new methods that handle computational and analytical problems for complex high-dimensional text data, develop a new query expansion approach to enhance the performance of information retrieval systems, and to present new techniques for analyzing multilingual topics using a translation service.
First, in the field of dimensionality reduction, we develop a new method for detecting and eliminating domain-based words. In this method, we use three different datasets and five classifiers for testing and evaluating the performance of our new approach before and after eliminating domain-based words. We compare the performance of our approach with other feature selection methods. We find that the new approach improves the performance of the binary classifier and reduces the dimensionality of the feature space by 90%. Also, our approach reduces the execution time of the classifier and outperforms one of the feature selection methods.
Second, in the field of information retrieval, we design and implement a method that integrates words from a current stream with external data sources in order to predict the occurrence of relevant words that have not yet appeared in the primary source. This algorithm enables the construction of new queries that effectively capture emergent events that a user may not have anticipated when initiating the data collection stream. The added value of using the external data sources appears when we have a stream of data and we want to predict something that has not yet happened instead of using only the stream that is limited to the available information at a specific time. We compare the performance of our approach with two alternative approaches. The first approach (static) expands user queries with words extracted from a probabilistic topic model of the stream. The second approach (emergent) reinforces user queries with emergent words extracted from the stream. We find that our method outperforms alternative approaches, exhibiting particularly good results in identifying future emergent topics.
Third, in the field of the multilingual text, we present a strategy to analyze the similarity between multilingual topics in English and Arabic tweets surrounding the 2020 COVID-19 pandemic. We make a descriptive comparison between topics in Arabic and English tweets about COVID-19 using tweets collected in the same way and filtered using the same keywords. We analyze Twitter’s discussion to understand the evolution of topics over time and reveal topic similarity among tweets across the datasets. We use probabilistic topic modeling to identify and extract the key topics of Twitter’s discussion in Arabic and English tweets. We use two methods to analyze the similarity between multilingual topics. The first method (full-text topic modeling approach) translates all text to English and then runs topic modeling to find similar topics. The second method (term-based topic modeling approach) runs topic modeling on the text before translation then translates the top keywords in each topic to find similar topics. We find similar topics related to COVID-19 pandemic covered in English and Arabic tweets for certain time intervals. Results indicate that the term-based topic modeling approach can reduce the cost compared to the full-text topic modeling approach and still have comparable results in finding similar topics. The computational time to translate the terms is significantly lower than the translation of the full text
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Recommended from our members
Learning Video Representation from Self-supervision
This thesis investigates the problem of learning video representations for video understanding. Previous works have explored the use of data-driven deep learning approaches, which have been shown to be effective in learning useful video representations. However, obtaining large amounts of labeled data can be costly and time-consuming. We investigate self-supervised approach as for multimodal video data to overcome this challenge. Video data typically contains multiple modalities, such as visual, audio, transcribed speech, and textual captions, which can serve as pseudo-labels for representation learning without needing manual labeling. By utilizing these modalities, we can train deep representations over large-scale video data consisting of millions of video clips collected from the internet. We demonstrate the scalability benefits of multimodal self-supervision by achieving new state-of-the-art performance in various domains, including video action recognition, text-to-video retrieval, and text-to-video grounding.
We also examine the limitations of these approaches, which often rely on the association assumption involving multiple modalities of data used in self-supervision. For example, the text transcript is often assumed to be about the video content, and two segments of the same video share similar semantics. To overcome this problem, we propose new methods for learning video representations with more intelligent sampling strategies to capture samples that share high-level semantics or consistent concepts. The proposed methods include a clustering component to address false negative pairs in multimodal paired contrastive learning, a novel sampling strategy for finding visually groundable video-text pairs, an investigation of object tracking supervision for temporal association, and a new multimodal task for demonstrating the effectiveness of the proposed model. We aim to develop more robust and generalizable video representations for real-world applications, such as human-to-robot interaction and event extraction from large-scale news sources
Learning to hash for large scale image retrieval
This thesis is concerned with improving the effectiveness of nearest neighbour search.
Nearest neighbour search is the problem of finding the most similar data-points to a
query in a database, and is a fundamental operation that has found wide applicability
in many fields. In this thesis the focus is placed on hashing-based approximate
nearest neighbour search methods that generate similar binary hashcodes for similar
data-points. These hashcodes can be used as the indices into the buckets of hashtables
for fast search. This work explores how the quality of search can be improved by
learning task specific binary hashcodes.
The generation of a binary hashcode comprises two main steps carried out sequentially:
projection of the image feature vector onto the normal vectors of a set of hyperplanes
partitioning the input feature space followed by a quantisation operation that
uses a single threshold to binarise the resulting projections to obtain the hashcodes.
The degree to which these operations preserve the relative distances between the datapoints
in the input feature space has a direct influence on the effectiveness of using
the resulting hashcodes for nearest neighbour search. In this thesis I argue that the
retrieval effectiveness of existing hashing-based nearest neighbour search methods can
be increased by learning the thresholds and hyperplanes based on the distribution of
the input data.
The first contribution is a model for learning multiple quantisation thresholds. I
demonstrate that the best threshold positioning is projection specific and introduce a
novel clustering algorithm for threshold optimisation. The second contribution extends
this algorithm by learning the optimal allocation of quantisation thresholds per hyperplane.
In doing so I argue that some hyperplanes are naturally more effective than others
at capturing the distribution of the data and should therefore attract a greater allocation
of quantisation thresholds. The third contribution focuses on the complementary
problem of learning the hashing hyperplanes. I introduce a multi-step iterative model
that, in the first step, regularises the hashcodes over a data-point adjacency graph,
which encourages similar data-points to be assigned similar hashcodes. In the second
step, binary classifiers are learnt to separate opposing bits with maximum margin. This
algorithm is extended to learn hyperplanes that can generate similar hashcodes for similar
data-points in two different feature spaces (e.g. text and images). Individually the
performance of these algorithms is often superior to competitive baselines. I unify my
contributions by demonstrating that learning hyperplanes and thresholds as part of the
same model can yield an additive increase in retrieval effectiveness
- …