8,853 research outputs found
Fusion of Text and Image in Multimedia Information Retrieval System
Multimedia Information Retrieval is very useful for any application in our daily work. Most of the applications consist of Multimedia data that are images, text, audio and video. Multimedia information retrieval system is used to search an image. There are same meanings for different data which is also known as semantic gap. This problem is solved by fusion of text based image retrieval and content based image retrieval. Weighted Mean, OWA and WOWA are aggregation operators used in this system for the fusion of text and image numeric values. The Scale invariant feature transforms and speeded up robust feature are two algorithms for feature extraction. To increase the speed of system, the speeded up robust feature algorithm is used. Bag of Words and Bag of Visual Word approaches are used in this system for retrieving images.
DOI: 10.17762/ijritcc2321-8169.15066
Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View
Multimedia collections are more than ever growing in size and diversity.
Effective multimedia retrieval systems are thus critical to access these
datasets from the end-user perspective and in a scalable way. We are interested
in repositories of image/text multimedia objects and we study multimodal
information fusion techniques in the context of content based multimedia
information retrieval. We focus on graph based methods which have proven to
provide state-of-the-art performances. We particularly examine two of such
methods : cross-media similarities and random walk based scores. From a
theoretical viewpoint, we propose a unifying graph based framework which
encompasses the two aforementioned approaches. Our proposal allows us to
highlight the core features one should consider when using a graph based
technique for the combination of visual and textual information. We compare
cross-media and random walk based results using three different real-world
datasets. From a practical standpoint, our extended empirical analysis allow us
to provide insights and guidelines about the use of graph based methods for
multimodal information fusion in content based multimedia information
retrieval.Comment: An extended version of the paper: Visual and Textual Information
Fusion in Multimedia Retrieval using Semantic Filtering and Graph based
Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM
Transactions on Information System
Visual and geographical data fusion to classify landmarks in geo-tagged images
High level semantic image recognition and classification is a challenging task and currently is a very active research domain. Computers struggle with the high level task of identifying objects and scenes within digital images accurately in unconstrained environments. In this paper, we present experiments that aim to overcome the limitations of computer vision algorithms by combining them with novel contextual based features to describe geo-tagged imagery. We adopt a machine learning based algorithm with the aim of classifying classes of geographical landmarks within digital images. We use community contributed image sets downloaded from Flickr and provide a thorough investigation, the results of which are presented in an evaluation section
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
TRECVid 2006 experiments at Dublin City University
In this paper we describe our retrieval system and experiments performed for the automatic search task in TRECVid 2006. We submitted the following six automatic runs:
ā¢ F A 1 DCU-Base 6: Baseline run using only ASR/MT text features.
ā¢ F A 2 DCU-TextVisual 2: Run using text and visual features.
ā¢ F A 2 DCU-TextVisMotion 5: Run using text, visual, and motion features.
ā¢ F B 2 DCU-Visual-LSCOM 3: Text and visual features combined with concept detectors.
ā¢ F B 2 DCU-LSCOM-Filters 4: Text, visual, and motion features with concept detectors.
ā¢ F B 2 DCU-LSCOM-2 1: Text, visual, motion, and concept detectors with negative concepts.
The experiments were designed both to study the addition of motion features and separately constructed models for semantic concepts, to runs using only textual and visual features, as well as to establish a baseline for the manually-assisted search runs performed within the collaborative K-Space project and described in the corresponding TRECVid 2006 notebook paper. The results of
the experiments indicate that the performance of automatic search can be improved with suitable concept models. This, however, is very topic-dependent and the questions of when to include such models and which concept models should be included, remain unanswered. Secondly, using motion features did not lead to performance improvement in our experiments. Finally, it was observed that our text features, despite displaying a rather poor performance overall, may still be useful even for generic search topics
Inexpensive fusion methods for enhancing feature detection
Recent successful approaches to high-level feature detection in image and video data have treated the problem as a pattern classification task. These typically leverage the techniques learned from statistical machine learning, coupled with ensemble architectures that create multiple feature detection models. Once created, co-occurrence between learned features can be captured to further boost performance. At multiple stages throughout these frameworks, various pieces of evidence can be fused together in order to boost performance. These approaches whilst very successful are computationally expensive, and depending on the task, require the use of significant computational resources. In this paper we propose two fusion methods that aim to combine the output of an initial basic statistical machine learning approach with a lower-quality information source, in order to gain diversity in the classified results whilst requiring only modest computing resources. Our approaches, validated experimentally on TRECVid data, are designed to be complementary to existing frameworks and can be regarded as possible replacements for the more computationally expensive combination strategies used elsewhere
TRECVid 2007 experiments at Dublin City University
In this paper we describe our retrieval system and experiments performed for the automatic search task in TRECVid 2007. We submitted the following six automatic runs:
ā¢ F A 1 DCU-TextOnly6: Baseline run using only ASR/MT text features.
ā¢ F A 1 DCU-ImgBaseline4: Baseline visual expert only run, no ASR/MT used.
Made use of query-time generation of retrieval expert coefficients for fusion.
ā¢ F A 2 DCU-ImgOnlyEnt5: Automatic generation of retrieval expert coefficients for fusion at index time.
ā¢ F A 2 DCU-imgOnlyEntHigh3: Combination of coefficient generation which combined the coefficients generated by the query-time approach, and the index-time approach, with greater weight given to the index-time coefficient.
ā¢ F A 2 DCU-imgOnlyEntAuto2: As above, except that greater weight is given to the query-time coefficient that was generated.
ā¢ F A 2 DCU-autoMixed1: Query-time expert coefficient generation that used both visual and text experts
Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos
We propose a new zero-shot Event Detection method by Multi-modal
Distributional Semantic embedding of videos. Our model embeds object and action
concepts as well as other available modalities from videos into a
distributional semantic space. To our knowledge, this is the first Zero-Shot
event detection model that is built on top of distributional semantics and
extends it in the following directions: (a) semantic embedding of multimodal
information in videos (with focus on the visual modalities), (b) automatically
determining relevance of concepts/attributes to a free text query, which could
be useful for other applications, and (c) retrieving videos by free text event
query (e.g., "changing a vehicle tire") based on their content. We embed videos
into a distributional semantic space and then measure the similarity between
videos and the event query in a free text form. We validated our method on the
large TRECVID MED (Multimedia Event Detection) challenge. Using only the event
title as a query, our method outperformed the state-of-the-art that uses big
descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC
metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201
- ā¦