Search CORE

104,696 research outputs found

Multi modal multi-semantic image retrieval

Author: Kesorn Kraisak
Publication venue
Publication date: 01/01/2010
Field of study

PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms

Author: Ma Lin
Malinowski Mateusz
Roberts Kirk
Stanley Clayton
Publication venue
Publication date: 25/05/2019
Field of study

Question categorization and expert retrieval methods have been crucial for information organization and accessibility in community question & answering (CQA) platforms. Research in this area, however, has dealt with only the text modality. With the increasing multimodal nature of web content, we focus on extending these methods for CQA questions accompanied by images. Specifically, we leverage the success of representation learning for text and images in the visual question answering (VQA) domain, and adapt the underlying concept and architecture for automated category classification and expert retrieval on image-based questions posted on Yahoo! Chiebukuro, the Japanese counterpart of Yahoo! Answers. To the best of our knowledge, this is the first work to tackle the multimodality challenge in CQA, and to adapt VQA models for tasks on a more ecologically valid source of visual questions. Our analysis of the differences between visual QA and community QA data drives our proposal of novel augmentations of an attention method tailored for CQA, and use of auxiliary tasks for learning better grounding features. Our final model markedly outperforms the text-only and VQA model baselines for both tasks of classification and expert retrieval on real-world multimodal CQA data.Comment: Submitted for review at CIKM 201

arXiv.org e-Print Archive

Semantics of video shots for content-based retrieval

Author: Volkmer T
Publication venue: RMIT University
Publication date: 01/01/2007
Field of study

Content-based video retrieval research combines expertise from many different areas, such as signal processing, machine learning, pattern recognition, and computer vision. As video extends into both the spatial and the temporal domain, we require techniques for the temporal decomposition of footage so that specific content can be accessed. This content may then be semantically classified - ideally in an automated process - to enable filtering, browsing, and searching. An important aspect that must be considered is that pictorial representation of information may be interpreted differently by individual users because it is less specific than its textual representation. In this thesis, we address several fundamental issues of content-based video retrieval for effective handling of digital footage. Temporal segmentation, the common first step in handling digital video, is the decomposition of video streams into smaller, semantically coherent entities. This is usually performed by detecting the transitions that separate single camera takes. While abrupt transitions - cuts - can be detected relatively well with existing techniques, effective detection of gradual transitions remains difficult. We present our approach to temporal video segmentation, proposing a novel algorithm that evaluates sets of frames using a relatively simple histogram feature. Our technique has been shown to range among the best existing shot segmentation algorithms in large-scale evaluations. The next step is semantic classification of each video segment to generate an index for content-based retrieval in video databases. Machine learning techniques can be applied effectively to classify video content. However, these techniques require manually classified examples for training before automatic classification of unseen content can be carried out. Manually classifying training examples is not trivial because of the implied ambiguity of visual content. We propose an unsupervised learning approach based on latent class modelling in which we obtain multiple judgements per video shot and model the users' response behaviour over a large collection of shots. This technique yields a more generic classification of the visual content. Moreover, it enables the quality assessment of the classification, and maximises the number of training examples by resolving disagreement. We apply this approach to data from a large-scale, collaborative annotation effort and present ways to improve the effectiveness for manual annotation of visual content by better design and specification of the process. Automatic speech recognition techniques along with semantic classification of video content can be used to implement video search using textual queries. This requires the application of text search techniques to video and the combination of different information sources. We explore several text-based query expansion techniques for speech-based video retrieval, and propose a fusion method to improve overall effectiveness. To combine both text and visual search approaches, we explore a fusion technique that combines spoken information and visual information using semantic keywords automatically assigned to the footage based on the visual content. The techniques that we propose help to facilitate effective content-based video retrieval and highlight the importance of considering different user interpretations of visual content. This allows better understanding of video content and a more holistic approach to multimedia retrieval in the future

Image processing and understanding based on graph similarity testing: algorithm design and software development

Author: Kang Jieqi
Publication venue: ScholarWorks@UMass Amherst
Publication date: 20/03/2017
Field of study

Image processing and understanding is a key task in the human visual system. Among all related topics, content based image retrieval and classification is the most typical and important problem. Successful image retrieval/classification models require an effective fundamental step of image representation and feature extraction. While traditional methods are not capable of capturing all structural information on the image, using graph to represent the image is not only biologically plausible but also has certain advantages. Graphs have been widely used in image related applications. Traditional graph-based image analysis models include pixel-based graph-cut techniques for image segmentation, low-level and high-level image feature extraction based on graph statistics and other related approaches which utilize the idea of graph similarity testing. To compare the images through their graph representations, a graph similarity testing algorithm is essential. Most of the existing graph similarity measurement tools are not designed for generic tasks such as image classification and retrieval, and some other models are either not scalable or not always effective. Graph spectral theory is a powerful analytical tool for capturing and representing structural information of the graph, but to use it on image understanding remains a challenge. In this dissertation, we focus on developing fast and effective image analysis models based on the spectral graph theory and other graph related mathematical tools. We first propose a fast graph similarity testing method based on the idea of the heat content and the mathematical theory of diffusion over manifolds. We then demonstrate the ability of our similarity testing model by comparing random graphs and power law graphs. Based on our graph analysis model, we develop a graph-based image representation and understanding framework. We propose the image heat content feature at first and then discuss several approaches to further improve the model. The first component in our improved framework is a novel graph generation model. The proposed model greatly reduces the size of the traditional pixel-based image graph representation and is shown to still be effective in representing an image. Meanwhile, we propose and discuss several low-level and high-level image features based on spectral graph information, including oscillatory image heat content, weighted eigenvalues and weighted heat content spectrum. Experiments show that the proposed models are invariant to non-structural changes on images and perform well in standard image classification benchmarks. Furthermore, our image features are robust to small distortions and changes of viewpoint. The model is also capable of capturing important image structural information on the image and performs well alone or in combination with other traditional techniques. We then introduce two real world software development projects using graph-based image processing techniques in this dissertation. Finally, we discuss the pros, cons and the intuition of our proposed model by demonstrating the properties of the proposed image feature and the correlation between different image features

Highly efficient low-level feature extraction for video representation and retrieval.

Author: Calie Janko
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2004
Field of study

PhDWitnessing the omnipresence of digital video media, the research community has raised the question of its meaningful use and management. Stored in immense multimedia databases, digital videos need to be retrieved and structured in an intelligent way, relying on the content and the rich semantics involved. Current Content Based Video Indexing and Retrieval systems face the problem of the semantic gap between the simplicity of the available visual features and the richness of user semantics. This work focuses on the issues of efficiency and scalability in video indexing and retrieval to facilitate a video representation model capable of semantic annotation. A highly efficient algorithm for temporal analysis and key-frame extraction is developed. It is based on the prediction information extracted directly from the compressed domain features and the robust scalable analysis in the temporal domain. Furthermore, a hierarchical quantisation of the colour features in the descriptor space is presented. Derived from the extracted set of low-level features, a video representation model that enables semantic annotation and contextual genre classification is designed. Results demonstrate the efficiency and robustness of the temporal analysis algorithm that runs in real time maintaining the high precision and recall of the detection task. Adaptive key-frame extraction and summarisation achieve a good overview of the visual content, while the colour quantisation algorithm efficiently creates hierarchical set of descriptors. Finally, the video representation model, supported by the genre classification algorithm, achieves excellent results in an automatic annotation system by linking the video clips with a limited lexicon of related keywords

Semantical representation and retrieval of natural photographs and medical images using concept and context-based feature spaces

Author: Rahman Md Mahmudur
Publication venue
Publication date: 01/01/2008
Field of study

The growth of image content production and distribution over the world has exploded in recent years. This creates a compelling need for developing innovative tools for managing and retrieving images for many applications, such as digital libraries, web image search engines, medical decision support systems, and so on. Until now, content-based image retrieval (CBIR) addresses the problem of finding images by automatically extracting low-level visual features, such as odor, texture, shape, etc. with limited success. The main limitation is due to the large semantic gap that currently exists between the high-level semantic concepts that users naturally associate with images and the low-level visual features that the system is relying upon. Research for the retrieval of images by semantic contents is still in its infancy. A successful solution to bridge or at least narrow the semantic gap requires the investigation of techniques from multiple fields. In addition, specialized retrieval solutions need to emerge, each of which should focus on certain types of image domains, users search requirements and applications objectivity. This work is motivated by a multi-disciplinary research effort and focuses on semantic-based image search from a domain perspective with an emphasis on natural photography and biomedical image databases. More precisely, we propose novel image representation and retrieval methods by transforming low-level feature spaces into concept-based feature spaces using statistical learning techniques. To this end, we perform supervised classification for modeling of semantic concepts and unsupervised clustering for constructing codebook of visual concepts to represent images in higher levels of abstraction for effective retrieval. Generalizing upon vector space model of Information Retrieval, we also investigate automatic query expansion techniques from a new perspective to reduce concept mismatch problem by analyzing their correlations information at both local and global levels in a collection. In addition, to perform retrieval in a complete semantic level, we propose an adaptive fusion-based retrieval technique in content and context-based feature spaces based on relevance feedback information from users. We developed a prototype image retrieval system as a part of the CINDI (Concordia INdexing and DIscovery system) digital library project, to perform exhaustive experimental evaluations and show the effectiveness of our retrieval approaches in both narrow and broad domains of application

Concordia University Research Repository

Audio-Video Detection and Fusion of Broad Casting Information

Author: SUBASHINI K.
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 31/01/2015
Field of study

In the last few decade of multimedia information systems, audio-video data has become an glowing part in many digital computer applications. Audio-video classification has been becoming a focus in the research of audio-video processing and pattern recognition. Automatic audio-video classification is very useful to audio-video indexing, content-based audio-video retrieval and on-line audio-video distribution such as online audio-video shopping, but it is a challenge to extract the most similar and salient themes from huge data of audio-video. In this paper, we propose effective algorithms to automatically segmentation and classify audio-video clips into one of Six classes: advertisement, cartoon, songs, serial, movie and news. For these categories a number of acoustic and visual features that include Mel Frequency Cepstral Coefficients, Color Histogram are extracted to characterize the audio and video data. The autoassociative neural network model (AANN) is used to capture the distribution of the acoustic and visual feature vectors. The AANN model captures the distribution of the acoustic and visual features of a class, and the back propagation learning algorithm is used to adjust the weights of the network to minimize the mean square error for each feature vector. Keywords: - Audio and Video detection, Audio and Video fusion, Mel Frequency Cepstral Coefficient, Color Histogram, Autoassociative Neural Network Model(AANN

Learning the structure of image collections with latent aspect models

Author: Monay Florent
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

The approach to indexing an image collection depends on the type of data to organize. Satellite images are likely to be searched with latitude and longitude coordinates, medical images are often searched with an image example that serves as a visual query, and personal image collections are generally browsed by event. A more general retrieval scenario is based on the use of textual keywords to search for images containing a specific object, or representing a given scene type. This requires the manual annotation of each image in the collection to allow for the retrieval of relevant visual information based on a text query. This time-consuming and subjective process is the current price to pay for a reliable and convenient text-based image search. This dissertation investigates the use of probabilistic models to assist the automatic organization of image collections, attempting to link the visual content of digital images with a potential textual description. Relying on robust, patch-based image representations that have proven to capture a variety of visual content, our work proposes to model images as mixtures of \emph{latent aspects}. These latent aspects are defined by multinomial distributions that capture patch co-occurrence information observed in the collection. An image is not represented by the direct count of its constituting elements, but as a mixture of latent aspects that can be estimated with principled, generative unsupervised learning methods. An aspect-based image representation therefore incorporates contextual information from the whole collection that can be exploited. This emerging concept is explored for several fundamental tasks related to image retrieval - namely classification, clustering, segmentation, and annotation - in what represents one of the first coherent and comprehensive study of the subject. We first investigate the possibility of classifying images based on their estimated aspect mixture weights, interpreting latent aspect modeling as an unsupervised feature extraction process. Several image categorization tasks are considered, where images are classified based on the present objects or according to their global scene type. We demonstrate that the concept of latent aspects allows to take advantage of non-labeled data to infer a robust image representation that achieves a higher classification performance than the original patch-based representation. Secondly, further exploring the concept, we show that aspects can correspond to an interesting soft clustering of an image collection that can serve as a browsing structure. Images can be ranked given an aspect, illustrating the corresponding co-occurrence context visually. In the third place, we derive a principled method that relies on latent aspects to classify image patches into different categories. This produces an image segmentation based on the resulting spatial class-densities. We finally propose to model images and their caption with a single aspect model, merging the co-occurrence contexts of the visual and the textual modalities in different ways. Once a model has been learned, the distribution of words given an unseen image is inferred based on its visual representation, and serves as textual indexing. Overall, we demonstrate with extensive experiments that the co-occurrence context captured by latent aspects is suitable for the above mentioned tasks, making it a promising approach for multimedia indexing