71 research outputs found

    Multilevel Language and Vision Integration for Text-to-Clip Retrieval

    Full text link
    We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.Comment: AAAI 201

    Joint Alignment and Modeling of Correlated Behavior Streams

    Get PDF
    The Variable Time-Shift Hidden Markov Model (VTS- HMM) is proposed for learning and modeling pairs of cor- related streams. Unlike previous coupled models for time series, the VTS-HMM accounts for varying time shifts be- tween correlated events in pairs of streams having different properties. The VTS-HMM is learned on a set of pairs of unaligned streams and, thus, learning entails simultaneous estimation of the varying time shifts and of the parameters of the model. The formulation is demonstrated in the analysis of videos of dyadic social interactions between children and adults in the Multimodal Dyadic Behavior Dataset (MMDB). In dyadic social interactions, an agent starts an interaction with one or more \u201cinitiating behaviors\u201d that elicit one or more \u201cresponding behaviors\u201d from the partner within a temporal window. The proposed VTS-HMM explicitly accounts for varying time shifts between initiating and responding behaviors in these behavior streams. The experiments confirm that modeling of these varying time shifts in the VTS-HMM can yield improved estimation of the level of engagement of the child and adult and more accurate dis- crimination among complex activities

    MULE: Multimodal Universal Language Embedding

    Full text link
    Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of MULE on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 21.9% on a single-language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available.Comment: Accepted as an oral at AAAI 202

    Leveraging Affect Transfer Learning for Behavior Prediction in an Intelligent Tutoring System

    Full text link
    In the context of building an intelligent tutoring system (ITS), which improves student learning outcomes by intervention, we set out to improve prediction of student problem outcome. In essence, we want to predict the outcome of a student answering a problem in an ITS from a video feed by analyzing their face and gestures. For this, we present a novel transfer learning facial affect representation and a user-personalized training scheme that unlocks the potential of this representation. We model the temporal structure of video sequences of students solving math problems using a recurrent neural network architecture. Additionally, we extend the largest dataset of student interactions with an intelligent online math tutor by a factor of two. Our final model, coined ATL-BP (Affect Transfer Learning for Behavior Prediction) achieves an increase in mean F-score over state-of-the-art of 45% on this new dataset in the general case and 50% in a more challenging leave-users-out experimental setting when we use a user-personalized training scheme

    Memetic electromagnetism algorithm for surface reconstruction with rational bivariate Bernstein basis functions

    Get PDF
    Surface reconstruction is a very important issue with outstanding applications in fields such as medical imaging (computer tomography, magnetic resonance), biomedical engineering (customized prosthesis and medical implants), computer-aided design and manufacturing (reverse engineering for the automotive, aerospace and shipbuilding industries), rapid prototyping (scale models of physical parts from CAD data), computer animation and film industry (motion capture, character modeling), archaeology (digital representation and storage of archaeological sites and assets), virtual/augmented reality, and many others. In this paper we address the surface reconstruction problem by using rational BĂ©zier surfaces. This problem is by far more complex than the case for curves we solved in a previous paper. In addition, we deal with data points subjected to measurement noise and irregular sampling, replicating the usual conditions of real-world applications. Our method is based on a memetic approach combining a powerful metaheuristic method for global optimization (the electromagnetism algorithm) with a local search method. This method is applied to a benchmark of five illustrative examples exhibiting challenging features. Our experimental results show that the method performs very well, and it can recover the underlying shape of surfaces with very good accuracy.This research is kindly supported by the Computer Science National Program of the Spanish Ministry of Economy and Competitiveness, Project #TIN2012-30768, Toho University, and the University of Cantabria. The authors are particularly grateful to the Department of Information Science of Toho University for all the facilities given to carry out this work. We also thank the Editor and the two anonymous reviewers who helped us to improve our paper with several constructive comments and suggestions
    • …
    corecore