1,013 research outputs found

    Few-Shot Image Recognition by Predicting Parameters from Activations

    Full text link
    In this paper, we are interested in the few-shot learning problem. In particular, we focus on a challenging scenario where the number of categories is large and the number of examples per novel category is very limited, e.g. 1, 2, or 3. Motivated by the close relationship between the parameters and the activations in a neural network associated with the same category, we propose a novel method that can adapt a pre-trained neural network to novel categories by directly predicting the parameters from the activations. Zero training is required in adaptation to novel categories, and fast inference is realized by a single forward pass. We evaluate our method by doing few-shot image recognition on the ImageNet dataset, which achieves the state-of-the-art classification accuracy on novel categories by a significant margin while keeping comparable performance on the large-scale categories. We also test our method on the MiniImageNet dataset and it strongly outperforms the previous state-of-the-art methods

    The Emotional Impact of Audio - Visual Stimuli

    Get PDF
    Induced affect is the emotional effect of an object on an individual. It can be quantified through two metrics: valence and arousal. Valance quantifies how positive or negative something is, while arousal quantifies the intensity from calm to exciting. These metrics enable researchers to study how people opine on various topics. Affective content analysis of visual media is a challenging problem due to differences in perceived reactions. Industry standard machine learning classifiers such as Support Vector Machines can be used to help determine user affect. The best affect-annotated video datasets are often analyzed by feeding large amounts of visual and audio features through machine-learning algorithms. The goal is to maximize accuracy, with the hope that each feature will bring useful information to the table. We depart from this approach to quantify how different modalities such as visual, audio, and text description information can aid in the understanding affect. To that end, we train independent models for visual, audio and text description. Each are convolutional neural networks paired with support vector machines to classify valence and arousal. We also train various ensemble models that combine multi-modal information with the hope that the information from independent modalities benefits each other. We find that our visual network alone achieves state-of-the-art valence classification accuracy and that our audio network, when paired with our visual, achieves competitive results on arousal classification. Each network is much stronger on one metric than the other. This may lead to more sophisticated multimodal approaches to accurately identifying affect in video data. This work also contributes to induced emotion classification by augmenting existing sizable media datasets and providing a robust framework for classifying the same

    Characterizing the impact of geometric properties of word embeddings on task performance

    Get PDF
    Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origin, distribution of features in the vector space, global pairwise distances, and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pretrained embeddings from three popular toolkits (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.Comment: Appearing in the Third Workshop on Evaluating Vector Space Representations for NLP (RepEval 2019). 7 pages + reference

    Optimized deep encoder-decoder methods for crack segmentation

    Get PDF
    Continuous maintenance of concrete infrastructure is an important task which is needed to continue safe operations of these structures. One kind of defect that occurs on surfaces in these structures are cracks. Automatic detection of those cracks poses a challenging computer vision task as background, shape, colour and size of cracks vary. In this work we propose optimized deep encoder-decoder methods consisting of a combination of techniques which yield an increase in crack segmentation performance. Specifically, we propose a new design for the decoder-part in encoder-decoder based deep learning architectures for semantic segmentation. We study its composition and how to achieve increased performance by exploring components such as deep supervision and upsampling strategies. Then we examine the optimal encoder to go in conjunction with this decoder and determine that pretrained encoders lead to an increase in performance. We propose a data augmentation strategy to increase the amount of available training data and carry out the performance evaluation of the designed architecture on four publicly available crack segmentation datasets. Additionally, we introduce two techniques into the field of surface crack segmentation, previously not used there: Generating results using test-time-augmentation and performing a statistical result analysis over multiple training runs. The former approach generally yields increased performance results, whereas the latter allows for more reproducible and better representability of a methods results. Using those aforementioned strategies with our proposed encoder-decoder architecture we are able to achieve new state of the art results in all datasets
    corecore