3,689 research outputs found

    Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

    Full text link
    In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high dimensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven empirically to be the leading representation for a large variety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distributions should be applied for different datasets, we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interesting property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks.Comment: new version includes text synthesis by an RNN and experiments with the COCO benchmar

    Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

    Full text link
    Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framework for future tool development, data annotation and resource sharing based upon annotation graphs and servers.Comment: 8 pages, 6 figure

    Restructuring multimodal corrective feedback through Augmented Reality (AR)-enabled videoconferencing in L2 pronunciation teaching

    Get PDF
    The problem of cognitive overload is particularly pertinent in multimedia L2 classroom corrective feedback (CF), which involves rich communicative tools to help the class to notice the mismatch between the target input and learners’ pronunciation. Based on multimedia design principles, this study developed a new multimodal CF model through augmented reality (AR)-enabled videoconferencing to eliminate extraneous cognitive load and guide learners’ attention to the essential material. Using a quasi-experimental design, this study aims to examine the effectiveness of this new CF model in improving Chinese L2 students’ segmental production and identification of the targeted English consonants (dark /É«/, /Ă°/and /Ξ/), as well as their attitudes towards this application. Results indicated that the online multimodal CF environment equipped with AR annotation and filters played a significant role in improving the participants’ production of the target segments. However, this advantage was not found in the auditory identification tests compared to the offline CF multimedia class. In addition, the learners reported that the new CF model helped to direct their attention to the articulatory gestures of the student being corrected, and enhance the class efficiency. Implications for computer-assisted pronunciation training and the construction of online/offline multimedia learning environments are also discussed

    Semantic Embedding Space for Zero-Shot Action Recognition

    Full text link
    The number of categories for action recognition is growing rapidly. It is thus becoming increasingly hard to collect sufficient training data to learn conventional models for each category. This issue may be ameliorated by the increasingly popular 'zero-shot learning' (ZSL) paradigm. In this framework a mapping is constructed between visual features and a human interpretable semantic description of each category, allowing categories to be recognised in the absence of any training data. Existing ZSL studies focus primarily on image data, and attribute-based semantic representations. In this paper, we address zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels. This is more challenging because the mapping between the semantic space and space-time features of videos containing complex actions is more complex and harder to learn. We demonstrate that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping. Experiments on human action datasets including HMDB51 and UCF101 demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance.Comment: 5 page
    • 

    corecore