26 research outputs found

    UTS CETC D2DCRC submission at the TRECVID 2018 video to text description task

    Full text link
    In this paper, we report our methods on the video to text description task of TRECVID 2018[1]. The task consists of two subtasks, i.e., Description generation and Matching & Ranking. In the description generation subtask, because no standard training data provided, we principally focused on saturating the generalization ability of our model. Instead of exploring complex models, we investigated the widely used LSTM based sequence to sequence model[10] and some of its variants, which are simple yet robust enough. Besides, we also reviewed some training strategies to expand the generalization ability of our model. In the matching and ranking subtask, we designed a two-branch deep model[6] to embed visual content and semantic content respectively. The model helps to project the information from different modalities into the common embedding space. Further, we examined some metric learning losses, like triplet loss and its variants, in our experiments

    Deep Reinforcement Sequence Learning for Visual Captioning

    Get PDF
    Methods to describe an image or video with natural language, namely image and video captioning, have recently converged into an encoder-decoder architecture. The encoder here is a deep convolutional neural network (CNN) that learns a fixed-length representation of the input image, and the decoder is a recurrent neural network (RNN), initialised with this representation, that generates a description of the scene in natural language. Traditional training mechanisms for this architecture usually optimise models using cross-entropy loss, which experiences two major problems. First, it inherently presents exposure bias (the model is only exposed to real descriptions, not to its own words), causing an incremental error in test time. Second, the ultimate objective is not directly optimised because the scoring metrics cannot be used in the procedure, as they are non-differentiable. New applications of reinforcement learning algorithms, such as self-critical training, overcome the exposure bias, while directly optimising non-differentiable sequence-based test metrics. This thesis reviews and analyses the performance of these different optimisation algorithms. Experiments on self-critic loss denote the importance of robust metrics against gaming to be used as the reward for the model, otherwise the qualitative performance is completely undermined. Sorting that out, the results do not reflect a huge quality improvement, but rather the expressiveness worsens and the vocabulary moves closer to what the reference uses. Subsequent experiments with a greatly improved encoder result in a marginal enhancing of the overall results, suggesting that the policy obtained is shown to be heavily constrained by the decoder language model. The thesis concludes that further analysis with higher capacity language models needs to be performed

    A Framework for Information Accessibility in Large Video Repositories

    Get PDF
    International audienceOnline videos are a medium of choice for young adults to access or receive information, and recent work has highlighted that it is a particularly effective medium for adults with intellectual disability, by its visual nature. Reflecting on a case study presenting fieldwork observations of how adults with intellectual disability engage with videos on the Youtube platform, we propose a framework to define and evaluate the accessibility of such large video repositories, from an informational perspective. The proposed framework nuances the concept of information accessibility from that of the accessibility of information access interfaces themselves (generally catered for under web accessibility guidelines), or that of the documents (generally covered in general accessibility guidelines). It also includes a notion of search (or browsing) accessibility, which reflects the ability to reach the document containing the information. In the context of large information repositories, this concept goes beyond how the documents are organized into how automated processes (browsing or searching) can support users. In addition to the framework we also detail specifics of document accessibility for videos. The framework suggests a multi-dimensional approach to information accessibility evaluation which includes both cognitive and sensory aspects. This framework can serve as a basis for practitioners when designing video information repositories accessible to people with intellectual disability, and extends on the information presentation guidelines such as suggested by the WCAG. Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only

    Multimodal Classification of Urban Micro-Events

    Get PDF
    In this paper we seek methods to effectively detect urban micro-events. Urban micro-events are events which occur in cities, have limited geographical coverage and typically affect only a small group of citizens. Because of their scale these are difficult to identify in most data sources. However, by using citizen sensing to gather data, detecting them becomes feasible. The data gathered by citizen sensing is often multimodal and, as a consequence, the information required to detect urban micro-events is distributed over multiple modalities. This makes it essential to have a classifier capable of combining them. In this paper we explore several methods of creating such a classifier, including early, late, hybrid fusion and representation learning using multimodal graphs. We evaluate performance on a real world dataset obtained from a live citizen reporting system. We show that a multimodal approach yields higher performance than unimodal alternatives. Furthermore, we demonstrate that our hybrid combination of early and late fusion with multimodal embeddings performs best in classification of urban micro-events

    The development of a video retrieval system using a clinician-led approach

    Get PDF
    Patient video taken at home can provide valuable insights into the recovery progress during a programme of physical therapy, but is very time consuming for clinician review. Our work focussed on (i) enabling any patient to share information about progress at home, simply by sharing video and (ii) building intelligent systems to support Physical Therapists (PTs) in reviewing this video data and extracting the necessary detail. This paper reports the development of the system, appropriate for future clinical use without reliance on a technical team, and the clinician involvement in that development. We contribute an interactive content-based video retrieval system that significantly reduces the time taken for clinicians to review videos, using human head movement as an example. The system supports query-by-movement (clinicians move their own body to define search queries) and retrieves the essential fine-grained movements needed for clinical interpretation. This is done by comparing sequences of image-based pose estimates (here head rotations) through a distance metric (here Fréchet distance) and presenting a ranked list of similar movements to clinicians for review. In contrast to existing intelligent systems for retrospective review of human movement, the system supports a flexible analysis where clinicians can look for any movement that interests them. Evaluation by a group of PTs with expertise in training movement control showed that 96% of all relevant movements were identified with time savings of as much as 99.1% compared to reviewing target videos in full. The novelty of this contribution includes retrospective progress monitoring that preserves context through video, and content-based video retrieval that supports both fine-grained human actions and query-by-movement. Future research, including large clinician-led studies, will refine the technical aspects and explore the benefits in terms of patient outcomes, PT time, and financial savings over the course of a programme of therapy. It is anticipated that this clinician-led approach will mitigate the reported slow clinical uptake of technology with resulting patient benefit
    corecore