Search CORE

488 research outputs found

HBST: A Hamming Distance embedding Binary Search Tree for Visual Place Recognition

Author: Grisetti Giorgio
Schlegel Dominik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Reliable and efficient Visual Place Recognition is a major building block of modern SLAM systems. Leveraging on our prior work, in this paper we present a Hamming Distance embedding Binary Search Tree (HBST) approach for binary Descriptor Matching and Image Retrieval. HBST allows for descriptor Search and Insertion in logarithmic time by exploiting particular properties of binary Feature descriptors. We support the idea behind our search structure with a thorough analysis on the exploited descriptor properties and their effects on completeness and complexity of search and insertion. To validate our claims we conducted comparative experiments for HBST and several state-of-the-art methods on a broad range of publicly available datasets. HBST is available as a compact open-source C++ header-only library.Comment: Submitted to IEEE Robotics and Automation Letters (RA-L) 2018 with International Conference on Intelligent Robots and Systems (IROS) 2018 option, 8 pages, 10 figure

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

Integration of Constraints into Dimensionality Reduction Methods for Visualization

Author: Vu Viet Minh
Publication venue
Publication date: 01/12/2021
Field of study

Repository of the University of Namur

Image Labeling and Classification by Semantic Tag Analysis

Author: Kirbac Ugur
Publication venue
Publication date: 08/05/2013
Field of study

Image classification and retrieval plays a significant role in dealing with large multimedia data on the Internet. Social networks, image sharing websites and mobile application require categorizing multimedia items for more efficient search and storage. Therefore, image classification and retrieval methods gained a great importance for researchers and companies. Image classification can be performed in a supervised and semi-supervised manner and in order to categorize an unknown image, a statistical model created using pre-labeled samples is fed with the numerical representation of the visual features of images. A supervised approach requires a set of labeled data to create a statistical model, and subsequently classify an unlabeled test set. However, labeling images manually requires a great deal of time and effort. Therefore, a major research activity has gravitated to wards finding efficient methods to reduce the time and effort for image labeling. Most images on social websites have associated tags that somewhat describe their content. These tags may provide significant content descriptors if a semantic bridge can be established between image content and tags. In this thesis, we focus on cases where accurate class labels are scarce or even absent while some associated tags are only present. The goal is to analyze and utilize available tags to categorize database images to form a training dataset over which a dedicated classifier is trained and then used for image classification. Our framework contains a semantic text analysis tool based on WordNet to measure the semantic relatedness between the associated image tags and predefined class labels, and a novel method for labeling the corresponding images. The classiﬁer is trained using only low-level visual image features. The experimental results using 7 classes from MirFlickr dataset demonstrate that semantically analyzing tags attached to images significantly improves the image classification accuracy by providing additional training data

Trepo - Institutional Repository of Tampere University

Vision and language understanding with localized evidence

Author: Xu Huijuan
Publication venue
Publication date: 16/02/2019
Field of study

Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research

Boston University Institutional Repository (OpenBU)

Recommended from our members

Explainable and Advisable Learning for Self-driving Vehicles

Author: Kim Jinkyu
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. Our work has focused on the challenge of generating introspective explanations of deep models for self-driving vehicles. In Chapter 3, we begin by exploring the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. In Chapter 4, we add an attention-based video-to-text model to produce textual explanations of model actions, e.g. "the car slows down because the road is wet". The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. These explainable systems represent an externalization of tacit knowledge. The network's opaque reasoning is simplified to a situation-specific dependence on a visible object in the image. This makes them brittle and potentially unsafe in situations that do not match training data. In Chapter 5, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice-giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Further, in Chapter 6, we propose a new approach that learns vehicle control with the help of long-term (global) human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. "I see a pedestrian crossing, so I stop"), and predict the controls, accordingly

eScholarship - University of California

A Survey on Video-based Graphics and Video Visualization

Author: Xianghua Xie
Publication venue: EUROGRAPHICS
Publication date: 01/01/2011
Field of study

Cronfa at Swansea University

Region-based Multimedia Indexing and Retrieval Framework

Author: Uhlmann Stefan
Publication venue
Publication date: 06/06/2007
Field of study

Many systems have been proposed for automatic description and indexing of digital data, for posterior retrieval. One of such content-based indexing-and-retrieval systems, and the one used as a framework in this thesis, is the MUVIS system, which was developed at Tampere University of Technology, in Finland. Moreover, Content-based Image Retrieval (CBIR) utilising frame-based and region-based features has been a dynamic research area in the past years. Several systems have been developed using their specific segmentation, feature extraction, and retrieval methods. In this thesis, a framework to model a regionalised CBIR framework is presented. The framework does not specify or fix the segmentation and local feature extraction methods, which are instead considered as “black-boxes” so as to allow the application of any segmentation method and visual descriptor. The proposed framework adopts a grouping approach in order to correct possible over- segmentation faults and a spatial feature called region proximity is introduced to describe regions topology in a visual scene by a block-based approach. Using the MUVIS system, a prototype system of the proposed framework is implemented as a region-based feature extraction module, which integrates simple colour segmentation and region-based feature description based on colour and texture. The spatial region proximity feature represents regions and describes their topology by a novel metric proposed in this thesis based on the block-based approach and average distance calculation. After the region-based feature extraction step, a feature vector is formed which holds information about all image regions with their local low-level and spatial properties. During the retrieval process, those feature vectors are used for computing the (dis-)similarity distances between two images, taking into account each of their individual components. In this case a many-to-one matching scheme between regions characterised by a similarity maximisation approach is integrated into a query-by-example scheme. Retrieval performance is evaluated between frame-based feature combination and the proposed framework with two different grouping approaches. Experiments are carried out on synthetic and natural image databases and the results indicate that a promising retrieval performance can be obtained as long as a reasonable segmentation quality is obtained. The integration of the region proximity feature further improves the retrieval performance especially for divisible, object-based image content. Finally, frame-based and region-based texture extraction schemes are compared to evaluate the effect of a region on the texture description and retrieval performance utilising the proposed framework. Results show that significant degradations over the retrieval performance occur on region-based texture descriptors compared with the frame-based approaches

Trepo - Institutional Repository of Tampere University

MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization

Author: Akhtar Md. Shad
Arora Udit
Chakraborty Tanmoy
S Ramaneswaran
Sharma Shivam
Publication venue
Publication date: 27/05/2023
Field of study

Memes are a powerful tool for communication over social media. Their affinity for evolving across politics, history, and sociocultural phenomena makes them an ideal communication vehicle. To comprehend the subtle message conveyed within a meme, one must understand the background that facilitates its holistic assimilation. Besides digital archiving of memes and their metadata by a few websites like knowyourmeme.com, currently, there is no efficient way to deduce a meme's context dynamically. In this work, we propose a novel task, MEMEX - given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. At first, we develop MCC (Meme Context Corpus), a novel dataset for MEMEX. Further, to benchmark MCC, we propose MIME (MultImodal Meme Explainer), a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context. MIME surpasses several unimodal and multimodal systems and yields an absolute improvement of ~ 4% F1-score over the best baseline. Lastly, we conduct detailed analyses of MIME's performance, highlighting the aspects that could lead to optimal modeling of cross-modal contextual associations.Comment: 9 pages main + 1 ethics + 3 pages ref. + 4 pages app (Total: 17 pages

arXiv.org e-Print Archive

Modeling the Unconscious Components of Marketing Communication:Familiarity, Decision Criteria, Primary Affect, and Mere-exposure Effect

Author: Ye G.W.
Publication venue: [s.n.]
Publication date: 01/01/2000
Field of study

Tilburg University Repository