24 research outputs found
Co-Regularized Deep Representations for Video Summarization
Compact keyframe-based video summaries are a popular way of generating
viewership on video sharing platforms. Yet, creating relevant and compelling
summaries for arbitrarily long videos with a small number of keyframes is a
challenging task. We propose a comprehensive keyframe-based summarization
framework combining deep convolutional neural networks and restricted Boltzmann
machines. An original co-regularization scheme is used to discover meaningful
subject-scene associations. The resulting multimodal representations are then
used to select highly-relevant keyframes. A comprehensive user study is
conducted comparing our proposed method to a variety of schemes, including the
summarization currently in use by one of the most popular video sharing
websites. The results show that our method consistently outperforms the
baseline schemes for any given amount of keyframes both in terms of
attractiveness and informativeness. The lead is even more significant for
smaller summaries.Comment: Video summarization, deep convolutional neural networks,
co-regularized restricted Boltzmann machine
MAEEG: Masked Auto-encoder for EEG Representation Learning
Decoding information from bio-signals such as EEG, using machine learning has
been a challenge due to the small data-sets and difficulty to obtain labels. We
propose a reconstruction-based self-supervised learning model, the masked
auto-encoder for EEG (MAEEG), for learning EEG representations by learning to
reconstruct the masked EEG features using a transformer architecture. We found
that MAEEG can learn representations that significantly improve sleep stage
classification (~5% accuracy increase) when only a small number of labels are
given. We also found that input sample lengths and different ways of masking
during reconstruction-based SSL pretraining have a huge effect on downstream
model performance. Specifically, learning to reconstruct a larger proportion
and more concentrated masked signal results in better performance on sleep
classification. Our findings provide insight into how reconstruction-based SSL
could help representation learning for EEG.Comment: 10 pages, 5 figures, accepted by Workshop on Learning from Time
Series for Health, NeurIPS2022 as poster presentatio
Position Prediction as an Effective Pretraining Strategy
Transformers have gained increasing popularity in a wide range of
applications, including Natural Language Processing (NLP), Computer Vision and
Speech Recognition, because of their powerful representational capacity.
However, harnessing this representational capacity effectively requires a large
amount of data, strong regularization, or both, to mitigate overfitting.
Recently, the power of the Transformer has been unlocked by self-supervised
pretraining strategies based on masked autoencoders which rely on
reconstructing masked inputs, directly, or contrastively from unmasked content.
This pretraining strategy which has been used in BERT models in NLP, Wav2Vec
models in Speech and, recently, in MAE models in Vision, forces the model to
learn about relationships between the content in different parts of the input
using autoencoding related objectives. In this paper, we propose a novel, but
surprisingly simple alternative to content reconstruction~-- that of predicting
locations from content, without providing positional information for it. Doing
so requires the Transformer to understand the positional relationships between
different parts of the input, from their content alone. This amounts to an
efficient implementation where the pretext task is a classification problem
among all possible positions for each input token. We experiment on both Vision
and Speech benchmarks, where our approach brings improvements over strong
supervised training baselines and is comparable to modern
unsupervised/self-supervised pretraining methods. Our method also enables
Transformers trained without position embeddings to outperform ones trained
with full position information.Comment: Accepted to ICML 202
Apprentissage de RepreĢsentations Visuelles Profondes
Recent advancements in the areas of deep learning and visual information processing have presented an opportunity to unite both fields. These complementary fields combine to tackle the problem of classifying images into their semantic categories. Deep learning brings learning and representational capabilities to a visual processing model that is adapted for image classification. This thesis addresses problems that lead to the proposal of learning deep visual representations for image classification. The problem of deep learning is tackled on two fronts. The first aspect is the problem of unsupervised learning of latent representations from input data. The main focus is the integration of prior knowledge into the learning of restricted Boltzmann machines (RBM) through regularization. Regularizers are proposed to induce sparsity, selectivity and topographic organization in the coding to improve discrimination and invariance. The second direction introduces the notion of gradually transiting from unsupervised layer-wise learning to supervised deep learning. This is done through the integration of bottom-up information with top-down signals. Two novel implementations supporting this notion are explored. The first method uses top-down regularization to train a deep network of RBMs. The second method combines predictive and reconstructive loss functions to optimize a stack of encoder-decoder networks. The proposed deep learning techniques are applied to tackle the image classification problem. The bag-of-words model is adopted due to its strengths in image modeling through the use of local image descriptors and spatial pooling schemes. Deep learning with spatial aggregation is used to learn a hierarchical visual dictionary for encoding the image descriptors into mid-level representations. This method achieves leading image classification performances for object and scene images. The learned dictionaries are diverse and non-redundant. The speed of inference is also high. From this, a further optimization is performed for the subsequent pooling step. This is done by introducing a differentiable pooling parameterization and applying the error backpropagation algorithm. This thesis represents one of the first attempts to synthesize deep learning and the bag-of-words model. This union results in many challenging research problems, leaving much room for further study in this area.Les avanceĢes reĢcentes en apprentissage profond et en traitement d'image preĢsentent l'opportuniteĢ d'unifier ces deux champs de recherche compleĢmentaires pour une meilleure reĢsolution du probleĢme de classification d'images dans des cateĢgories seĢmantiques. L'apprentissage profond apporte au traitement d'image le pouvoir de repreĢsentation neĢcessaire aĢ l'ameĢlioration des performances des meĢthodes de classification d'images. Cette theĢse propose de nouvelles meĢthodes d'apprentissage de repreĢsentations visuelles profondes pour la reĢsolution de cette tache. L'apprentissage profond a eĢteĢ abordeĢ sous deux angles. D'abord nous nous sommes inteĢresseĢs aĢ l'apprentissage non superviseĢ de repreĢsentations latentes ayant certaines proprieĢteĢs aĢ partir de donneĢes en entreĢe. Il s'agit ici d'inteĢgrer une connaissance aĢ priori, aĢ travers un terme de reĢgularisation, dans l'apprentissage d'une machine de Boltzmann restreinte. Nous proposons plusieurs formes de reĢgularisation qui induisent diffeĢrentes proprieĢteĢs telles que la parcimonie, la seĢlectiviteĢ et l'organisation en structure topographique. Le second aspect consiste au passage graduel de l'apprentissage non superviseĢ aĢ l'apprentissage superviseĢ de reĢseaux profonds. Ce but est reĢaliseĢ par l'introduction sous forme de supervision, d'une information relative aĢ la cateĢgorie seĢmantique. Deux nouvelles meĢthodes sont proposeĢes. Le premier est baseĢ sur une reĢgularisation top-down de reĢseaux de croyance profonds aĢ base de machines des Boltzmann restreintes. Le second optimise un cout inteĢgrant un criteĢre de reconstruction et un criteĢre de supervision pour l'entrainement d'autoencodeurs profonds. Les meĢthodes proposeĢes ont eĢteĢ appliqueĢes au probleĢme de classification d'images. Nous avons adopteĢ le modeĢle sac-de-mots comme modeĢle de base parce qu'il offre d'importantes possibiliteĢs graĢce aĢ l'utilisation de descripteurs locaux robustes et de pooling par pyramides spatiales qui prennent en compte l'information spatiale de l'image. L'apprentissage profonds avec agreĢgation spatiale est utiliseĢ pour apprendre un dictionnaire hieĢrarchique pour l'encodage de repreĢsentations visuelles de niveau intermeĢdiaire. Cette meĢthode donne des reĢsultats treĢs compeĢtitifs en classification de sceĢnes et d'images. Les dictionnaires visuels appris contiennent diverses informations non-redondantes ayant une structure spatiale coheĢrente. L'infeĢrence est aussi treĢs rapide. Nous avons par la suite optimiseĢ l'eĢtape de pooling sur la base du codage produit par le dictionnaire hieĢrarchique preĢceĢdemment appris en introduisant introduit une nouvelle parameĢtrisation deĢrivable de l'opeĢration de pooling qui permet un apprentissage par descente de gradient utilisant l'algorithme de reĢtro-propagation. Ceci est la premieĢre tentative d'unification de l'apprentissage profond et du modeĢle de sac de mots. Bien que cette fusion puisse sembler eĢvidente, l'union de plusieurs aspects de l'apprentissage profond de repreĢsentations visuelles demeure une tache complexe aĢ bien des eĢgards et requiert encore un effort de recherche important