24 research outputs found

    Co-Regularized Deep Representations for Video Summarization

    Full text link
    Compact keyframe-based video summaries are a popular way of generating viewership on video sharing platforms. Yet, creating relevant and compelling summaries for arbitrarily long videos with a small number of keyframes is a challenging task. We propose a comprehensive keyframe-based summarization framework combining deep convolutional neural networks and restricted Boltzmann machines. An original co-regularization scheme is used to discover meaningful subject-scene associations. The resulting multimodal representations are then used to select highly-relevant keyframes. A comprehensive user study is conducted comparing our proposed method to a variety of schemes, including the summarization currently in use by one of the most popular video sharing websites. The results show that our method consistently outperforms the baseline schemes for any given amount of keyframes both in terms of attractiveness and informativeness. The lead is even more significant for smaller summaries.Comment: Video summarization, deep convolutional neural networks, co-regularized restricted Boltzmann machine

    MAEEG: Masked Auto-encoder for EEG Representation Learning

    Full text link
    Decoding information from bio-signals such as EEG, using machine learning has been a challenge due to the small data-sets and difficulty to obtain labels. We propose a reconstruction-based self-supervised learning model, the masked auto-encoder for EEG (MAEEG), for learning EEG representations by learning to reconstruct the masked EEG features using a transformer architecture. We found that MAEEG can learn representations that significantly improve sleep stage classification (~5% accuracy increase) when only a small number of labels are given. We also found that input sample lengths and different ways of masking during reconstruction-based SSL pretraining have a huge effect on downstream model performance. Specifically, learning to reconstruct a larger proportion and more concentrated masked signal results in better performance on sleep classification. Our findings provide insight into how reconstruction-based SSL could help representation learning for EEG.Comment: 10 pages, 5 figures, accepted by Workshop on Learning from Time Series for Health, NeurIPS2022 as poster presentatio

    Position Prediction as an Effective Pretraining Strategy

    Full text link
    Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.Comment: Accepted to ICML 202

    Apprentissage de RepreĢsentations Visuelles Profondes

    No full text
    Recent advancements in the areas of deep learning and visual information processing have presented an opportunity to unite both fields. These complementary fields combine to tackle the problem of classifying images into their semantic categories. Deep learning brings learning and representational capabilities to a visual processing model that is adapted for image classification. This thesis addresses problems that lead to the proposal of learning deep visual representations for image classification. The problem of deep learning is tackled on two fronts. The first aspect is the problem of unsupervised learning of latent representations from input data. The main focus is the integration of prior knowledge into the learning of restricted Boltzmann machines (RBM) through regularization. Regularizers are proposed to induce sparsity, selectivity and topographic organization in the coding to improve discrimination and invariance. The second direction introduces the notion of gradually transiting from unsupervised layer-wise learning to supervised deep learning. This is done through the integration of bottom-up information with top-down signals. Two novel implementations supporting this notion are explored. The first method uses top-down regularization to train a deep network of RBMs. The second method combines predictive and reconstructive loss functions to optimize a stack of encoder-decoder networks. The proposed deep learning techniques are applied to tackle the image classification problem. The bag-of-words model is adopted due to its strengths in image modeling through the use of local image descriptors and spatial pooling schemes. Deep learning with spatial aggregation is used to learn a hierarchical visual dictionary for encoding the image descriptors into mid-level representations. This method achieves leading image classification performances for object and scene images. The learned dictionaries are diverse and non-redundant. The speed of inference is also high. From this, a further optimization is performed for the subsequent pooling step. This is done by introducing a differentiable pooling parameterization and applying the error backpropagation algorithm. This thesis represents one of the first attempts to synthesize deep learning and the bag-of-words model. This union results in many challenging research problems, leaving much room for further study in this area.Les avanceĢes reĢcentes en apprentissage profond et en traitement d'image preĢsentent l'opportuniteĢ d'unifier ces deux champs de recherche compleĢmentaires pour une meilleure reĢsolution du probleĢ€me de classification d'images dans des cateĢgories seĢmantiques. L'apprentissage profond apporte au traitement d'image le pouvoir de repreĢsentation neĢcessaire aĢ€ l'ameĢlioration des performances des meĢthodes de classification d'images. Cette theĢ€se propose de nouvelles meĢthodes d'apprentissage de repreĢsentations visuelles profondes pour la reĢsolution de cette tache. L'apprentissage profond a eĢteĢ abordeĢ sous deux angles. D'abord nous nous sommes inteĢresseĢs aĢ€ l'apprentissage non superviseĢ de repreĢsentations latentes ayant certaines proprieĢteĢs aĢ€ partir de donneĢes en entreĢe. Il s'agit ici d'inteĢgrer une connaissance aĢ€ priori, aĢ€ travers un terme de reĢgularisation, dans l'apprentissage d'une machine de Boltzmann restreinte. Nous proposons plusieurs formes de reĢgularisation qui induisent diffeĢrentes proprieĢteĢs telles que la parcimonie, la seĢlectiviteĢ et l'organisation en structure topographique. Le second aspect consiste au passage graduel de l'apprentissage non superviseĢ aĢ€ l'apprentissage superviseĢ de reĢseaux profonds. Ce but est reĢaliseĢ par l'introduction sous forme de supervision, d'une information relative aĢ€ la cateĢgorie seĢmantique. Deux nouvelles meĢthodes sont proposeĢes. Le premier est baseĢ sur une reĢgularisation top-down de reĢseaux de croyance profonds aĢ€ base de machines des Boltzmann restreintes. Le second optimise un cout inteĢgrant un criteĢ€re de reconstruction et un criteĢ€re de supervision pour l'entrainement d'autoencodeurs profonds. Les meĢthodes proposeĢes ont eĢteĢ appliqueĢes au probleĢ€me de classification d'images. Nous avons adopteĢ le modeĢ€le sac-de-mots comme modeĢ€le de base parce qu'il offre d'importantes possibiliteĢs graĢ‚ce aĢ€ l'utilisation de descripteurs locaux robustes et de pooling par pyramides spatiales qui prennent en compte l'information spatiale de l'image. L'apprentissage profonds avec agreĢgation spatiale est utiliseĢ pour apprendre un dictionnaire hieĢrarchique pour l'encodage de repreĢsentations visuelles de niveau intermeĢdiaire. Cette meĢthode donne des reĢsultats treĢ€s compeĢtitifs en classification de sceĢ€nes et d'images. Les dictionnaires visuels appris contiennent diverses informations non-redondantes ayant une structure spatiale coheĢrente. L'infeĢrence est aussi treĢ€s rapide. Nous avons par la suite optimiseĢ l'eĢtape de pooling sur la base du codage produit par le dictionnaire hieĢrarchique preĢceĢdemment appris en introduisant introduit une nouvelle parameĢtrisation deĢrivable de l'opeĢration de pooling qui permet un apprentissage par descente de gradient utilisant l'algorithme de reĢtro-propagation. Ceci est la premieĢ€re tentative d'unification de l'apprentissage profond et du modeĢ€le de sac de mots. Bien que cette fusion puisse sembler eĢvidente, l'union de plusieurs aspects de l'apprentissage profond de repreĢsentations visuelles demeure une tache complexe aĢ€ bien des eĢgards et requiert encore un effort de recherche important
    corecore