48 research outputs found
Reliability and effectiveness of clickthrough data for automatic image annotation
Automatic image annotation using supervised learning is performed by concept classifiers trained on labelled example images. This work proposes the use of clickthrough data collected from search logs as a source for the automatic generation of concept training data, thus avoiding the expensive manual annotation effort. We investigate and evaluate this approach using a collection of 97,628 photographic images. The results indicate that the contribution of search log based training data is positive despite their inherent noise; in particular, the combination of manual and automatically generated training data outperforms the use of manual data alone. It is therefore possible to use clickthrough data to perform large-scale image annotation with little manual annotation effort or, depending on performance, using only the automatically generated training data. An extensive presentation of the experimental results and the accompanying data can be accessed at http://olympus.ee.auth.gr/~diou/civr2009/
IMAGE RETRIEVAL BASED ON COMPLEX DESCRIPTIVE QUERIES
The amount of visual data such as images and videos available over web has increased exponentially over the last few years. In order to efficiently organize and exploit these massive collections, a system, apart from being able to answer simple classification based questions such as whether a specific object is present (or absent) in an image, should also be capable of searching images and videos based on more complex descriptive questions. There is also a considerable amount of structure present in the visual world which, if effectively utilized, can help achieve this goal. To this end, we first present an approach for image ranking and retrieval based on queries consisting of multiple semantic attributes. We further show that there are significant correlations present between these attributes and accounting for them can lead to superior performance. Next, we extend this by proposing an image retrieval framework for descriptive queries composed of object categories, semantic attributes and spatial relationships. The proposed framework also includes a unique multi-view hashing technique, which enables query specification in three different modalities - image, sketch and text.
We also demonstrate the effectiveness of leveraging contextual information to reduce the supervision requirements for learning object and scene recognition models. We present an active learning framework to simultaneously learn appearance and contextual models for scene understanding. Within this framework we introduce new kinds of labeling questions that are designed to collect appearance as well as contextual information and which mimic the way in which humans actively learn about their environment. Furthermore we explicitly model the contextual interactions between the regions within an image and select the question which leads to the maximum reduction in the combined entropy of all the regions in the image (image entropy)
Video Content Understanding Using Text
The rise of the social media and video streaming industry provided us a plethora of videos and their corresponding descriptive information in the form of concepts (words) and textual video captions. Due to the mass amount of available videos and the textual data, today is the best time ever to study the Computer Vision and Machine Learning problems related to videos and text. In this dissertation, we tackle multiple problems associated with the joint understanding of videos and text. We first address the task of multi-concept video retrieval, where the input is a set of words as concepts, and the output is a ranked list of full-length videos. This approach deals with multi-concept input and prolonged length of videos by incorporating multi-latent variables to tie the information within each shot (short clip of a full-video) and across shots. Secondly, we address the problem of video question answering, in which, the task is to answer a question, in the form of Fill-In-the-Blank (FIB), given a video. Answering a question is a task of retrieving a word from a dictionary (all possible words suitable for an answer) based on the input question and video. Following the FIB problem, we introduce a new problem, called Visual Text Correction (VTC), i.e., detecting and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence while benefiting 1D-CNNs/LSTMs to encode short/long term dependencies, and fix it by replacing the inaccurate word(s). Finally, as the last part of the dissertation, we propose to tackle the problem of video generation using user input natural language sentences. Our proposed video generation method constructs two distributions out of the input text, corresponding to the first and last frames latent representations. We generate high-fidelity videos by interpolating latent representations and a sequence of CNN based up-pooling blocks
Video metadata extraction in a videoMail system
Currently the world swiftly adapts to visual communication. Online services like
YouTube and Vine show that video is no longer the domain of broadcast television only.
Video is used for different purposes like entertainment, information, education or communication.
The rapid growth of today’s video archives with sparsely available editorial data creates
a big problem of its retrieval. The humans see a video like a complex interplay of
cognitive concepts. As a result there is a need to build a bridge between numeric values and semantic concepts. This establishes a connection that will facilitate videos’ retrieval by humans.
The critical aspect of this bridge is video annotation. The process could be done manually or automatically. Manual annotation is very tedious, subjective and expensive.
Therefore automatic annotation is being actively studied.
In this thesis we focus on the multimedia content automatic annotation. Namely
the use of analysis techniques for information retrieval allowing to automatically extract
metadata from video in a videomail system. Furthermore the identification of text, people, actions, spaces, objects, including animals and plants.
Hence it will be possible to align multimedia content with the text presented in the
email message and the creation of applications for semantic video database indexing and retrieving
Semantic multimedia analysis using knowledge and context
PhDThe difficulty of semantic multimedia analysis can be attributed to the
extended diversity in form and appearance exhibited by the majority of
semantic concepts and the difficulty to express them using a finite number
of patterns. In meeting this challenge there has been a scientific debate
on whether the problem should be addressed from the perspective of using
overwhelming amounts of training data to capture all possible instantiations
of a concept, or from the perspective of using explicit knowledge about
the concepts’ relations to infer their presence. In this thesis we address
three problems of pattern recognition and propose solutions that combine
the knowledge extracted implicitly from training data with the knowledge
provided explicitly in structured form. First, we propose a BNs modeling
approach that defines a conceptual space where both domain related evi-
dence and evidence derived from content analysis can be jointly considered
to support or disprove a hypothesis. The use of this space leads to sig-
nificant gains in performance compared to analysis methods that can not
handle combined knowledge. Then, we present an unsupervised method
that exploits the collective nature of social media to automatically obtain
large amounts of annotated image regions. By proving that the quality of
the obtained samples can be almost as good as manually annotated images
when working with large datasets, we significantly contribute towards scal-
able object detection. Finally, we introduce a method that treats images,
visual features and tags as the three observable variables of an aspect model
and extracts a set of latent topics that incorporates the semantics of both
visual and tag information space. By showing that the cross-modal depen-
dencies of tagged images can be exploited to increase the semantic capacity
of the resulting space, we advocate the use of all existing information facets
in the semantic analysis of social media
Learning Transferable Representations for Visual Recognition
In the last half-decade, a new renaissance of machine learning originates from the applications of convolutional neural networks to visual recognition tasks. It is believed that a combination of big curated data and novel deep learning techniques can lead to unprecedented results. However, the increasingly large training data is still a drop in the ocean compared with scenarios in the wild. In this literature, we focus on learning transferable representation in the neural networks to ensure the models stay robust, even given different data distributions. We present three exemplar topics in three chapters, respectively: zero-shot learning, domain adaptation, and generalizable adversarial attack. By zero-shot learning, we enable models to predict labels not seen in the training phase. By domain adaptation, we improve a model\u27s performance on the target domain by mitigating its discrepancy from a labeled source model, without any target annotation. Finally, the generalization adversarial attack focuses on learning an adversarial camouflage that ideally would work in every possible scenario. Despite sharing the same transfer learning philosophy, each of the proposed topics poses a unique challenge requiring a unique solution. In each chapter, we introduce the problem as well as present our solution to the problem. We also discuss some other researchers\u27 approaches and compare our solution to theirs in the experiments
Interactive video retrieval using implicit user feedback.
PhDIn the recent years, the rapid development of digital technologies and the low
cost of recording media have led to a great increase in the availability of
multimedia content worldwide. This availability places the demand for the
development of advanced search engines. Traditionally, manual annotation of
video was one of the usual practices to support retrieval. However, the vast
amounts of multimedia content make such practices very expensive in terms of
human effort. At the same time, the availability of low cost wearable sensors
delivers a plethora of user-machine interaction data. Therefore, there is an
important challenge of exploiting implicit user feedback (such as user navigation
patterns and eye movements) during interactive multimedia retrieval sessions
with a view to improving video search engines. In this thesis, we focus on
automatically annotating video content by exploiting aggregated implicit
feedback of past users expressed as click-through data and gaze movements.
Towards this goal, we have conducted interactive video retrieval experiments, in
order to collect click-through and eye movement data in not strictly controlled
environments. First, we generate semantic relations between the multimedia
items by proposing a graph representation of aggregated past interaction data and
exploit them to generate recommendations, as well as to improve content-based
search. Then, we investigate the role of user gaze movements in interactive video
retrieval and propose a methodology for inferring user interest by employing
support vector machines and gaze movement-based features. Finally, we propose
an automatic video annotation framework, which combines query clustering into
topics by constructing gaze movement-driven random forests and temporally
enhanced dominant sets, as well as video shot classification for predicting the
relevance of viewed items with respect to a topic. The results show that
exploiting heterogeneous implicit feedback from past users is of added value for
future users of interactive video retrieval systems