488 research outputs found

    HBST: A Hamming Distance embedding Binary Search Tree for Visual Place Recognition

    Get PDF
    Reliable and efficient Visual Place Recognition is a major building block of modern SLAM systems. Leveraging on our prior work, in this paper we present a Hamming Distance embedding Binary Search Tree (HBST) approach for binary Descriptor Matching and Image Retrieval. HBST allows for descriptor Search and Insertion in logarithmic time by exploiting particular properties of binary Feature descriptors. We support the idea behind our search structure with a thorough analysis on the exploited descriptor properties and their effects on completeness and complexity of search and insertion. To validate our claims we conducted comparative experiments for HBST and several state-of-the-art methods on a broad range of publicly available datasets. HBST is available as a compact open-source C++ header-only library.Comment: Submitted to IEEE Robotics and Automation Letters (RA-L) 2018 with International Conference on Intelligent Robots and Systems (IROS) 2018 option, 8 pages, 10 figure

    Image Labeling and Classification by Semantic Tag Analysis

    Get PDF
    Image classification and retrieval plays a significant role in dealing with large multimedia data on the Internet. Social networks, image sharing websites and mobile application require categorizing multimedia items for more efficient search and storage. Therefore, image classification and retrieval methods gained a great importance for researchers and companies. Image classification can be performed in a supervised and semi-supervised manner and in order to categorize an unknown image, a statistical model created using pre-labeled samples is fed with the numerical representation of the visual features of images. A supervised approach requires a set of labeled data to create a statistical model, and subsequently classify an unlabeled test set. However, labeling images manually requires a great deal of time and effort. Therefore, a major research activity has gravitated to wards finding efficient methods to reduce the time and effort for image labeling. Most images on social websites have associated tags that somewhat describe their content. These tags may provide significant content descriptors if a semantic bridge can be established between image content and tags. In this thesis, we focus on cases where accurate class labels are scarce or even absent while some associated tags are only present. The goal is to analyze and utilize available tags to categorize database images to form a training dataset over which a dedicated classifier is trained and then used for image classification. Our framework contains a semantic text analysis tool based on WordNet to measure the semantic relatedness between the associated image tags and predefined class labels, and a novel method for labeling the corresponding images. The classifier is trained using only low-level visual image features. The experimental results using 7 classes from MirFlickr dataset demonstrate that semantically analyzing tags attached to images significantly improves the image classification accuracy by providing additional training data

    Vision and language understanding with localized evidence

    Full text link
    Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research

    A Survey on Video-based Graphics and Video Visualization

    Get PDF

    Region-based Multimedia Indexing and Retrieval Framework

    Get PDF
    Many systems have been proposed for automatic description and indexing of digital data, for posterior retrieval. One of such content-based indexing-and-retrieval systems, and the one used as a framework in this thesis, is the MUVIS system, which was developed at Tampere University of Technology, in Finland. Moreover, Content-based Image Retrieval (CBIR) utilising frame-based and region-based features has been a dynamic research area in the past years. Several systems have been developed using their specific segmentation, feature extraction, and retrieval methods. In this thesis, a framework to model a regionalised CBIR framework is presented. The framework does not specify or fix the segmentation and local feature extraction methods, which are instead considered as “black-boxes” so as to allow the application of any segmentation method and visual descriptor. The proposed framework adopts a grouping approach in order to correct possible over- segmentation faults and a spatial feature called region proximity is introduced to describe regions topology in a visual scene by a block-based approach. Using the MUVIS system, a prototype system of the proposed framework is implemented as a region-based feature extraction module, which integrates simple colour segmentation and region-based feature description based on colour and texture. The spatial region proximity feature represents regions and describes their topology by a novel metric proposed in this thesis based on the block-based approach and average distance calculation. After the region-based feature extraction step, a feature vector is formed which holds information about all image regions with their local low-level and spatial properties. During the retrieval process, those feature vectors are used for computing the (dis-)similarity distances between two images, taking into account each of their individual components. In this case a many-to-one matching scheme between regions characterised by a similarity maximisation approach is integrated into a query-by-example scheme. Retrieval performance is evaluated between frame-based feature combination and the proposed framework with two different grouping approaches. Experiments are carried out on synthetic and natural image databases and the results indicate that a promising retrieval performance can be obtained as long as a reasonable segmentation quality is obtained. The integration of the region proximity feature further improves the retrieval performance especially for divisible, object-based image content. Finally, frame-based and region-based texture extraction schemes are compared to evaluate the effect of a region on the texture description and retrieval performance utilising the proposed framework. Results show that significant degradations over the retrieval performance occur on region-based texture descriptors compared with the frame-based approaches

    MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization

    Full text link
    Memes are a powerful tool for communication over social media. Their affinity for evolving across politics, history, and sociocultural phenomena makes them an ideal communication vehicle. To comprehend the subtle message conveyed within a meme, one must understand the background that facilitates its holistic assimilation. Besides digital archiving of memes and their metadata by a few websites like knowyourmeme.com, currently, there is no efficient way to deduce a meme's context dynamically. In this work, we propose a novel task, MEMEX - given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. At first, we develop MCC (Meme Context Corpus), a novel dataset for MEMEX. Further, to benchmark MCC, we propose MIME (MultImodal Meme Explainer), a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context. MIME surpasses several unimodal and multimodal systems and yields an absolute improvement of ~ 4% F1-score over the best baseline. Lastly, we conduct detailed analyses of MIME's performance, highlighting the aspects that could lead to optimal modeling of cross-modal contextual associations.Comment: 9 pages main + 1 ethics + 3 pages ref. + 4 pages app (Total: 17 pages