13 research outputs found

    Modeling, classifying and annotating weakly annotated images using bayesian network

    Get PDF
    International audienceIn this paper, we propose a probabilistic graphical model to represent weakly annotated images. We consider an image as weakly annotated if the number of keywords defined for it is less than the maximum number defined in the ground truth. This model is used to classify images and automatically extend existing annotations to new images by taking into account semantic relations between keywords. The proposed method has been evaluated in visual-textual classification and automatic annotation of images. The visualtextual classification is performed by using both visual and textual information. The experimental results, obtained from a database of more than 30000 images, show an improvement by 50.5% in terms of recognition rate against only visual information classification. Taking into account semantic relations between keywords improves the recognition rate by 10.5%. Moreover, the proposed model can be used to extend existing annotations to weakly annotated images, by computing distributions of missing keywords. Semantic relations improve the mean rate of good annotations by 6.9%. Finally, the proposed method is competitive with a state-of-art model

    Flexible photo retrieval (FlexPhoReS) : a prototype for multimodel personal digital photo retrieval

    Get PDF
    Digital photo technology is developing rapidly and is motivating more people to have large personal collections of digital photos. However, effective and fast retrieval of digital photos is not always easy, especially when the collections grow into thousands. World Wide Web (WWW) is one of the platforms that allows digital photo users to publish a collection of photos in a centralised and organised way. Users typically find their photos by searching or browsing uSing a keyboard and mouse. Also in development at the moment are alternative user interfaces such as graphical user interfaces with speech (S/GUI) and other multimodal user interfaces which offer more flexibility to users. The aim of this research was to design and evaluate a flexible user interface for a web based personal digital photo retrieval system. A model of a flexible photo retrieval system (FlexPhoReS) was developed based on a review of the literature and a small-scale user study. A prototype, based on the model, was built using MATLAB and WWW technology. FlexPhoReS is a web based personal digital photo retrieval prototype that enables digital photo users to . accomplish photo retrieval tasks (browsing, keyword and visual example searching (CBI)) using either mouse and keyboard input modalities or mouse and speech input modalities. An evaluation with 20 digital photo users was conducted using usability testing methods. The result showed that there was a significant difference in search performance between using mouse and keyboard input modalities and using mouse and speech input modalities. On average, the reduction in search performance time due to using mouse and speech input modalities was 37.31%. Participants were also significantly more satisfied with mouse and speech input modalities than with mouse and keyboard input modalities although they felt that both were complementary. This research demonstrated that the prototype was successful in providing a flexible model of the photo retrieval process by offering alternative input modalities through a multimodal user interface in the World Wide Web environment.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Multi modal multi-semantic image retrieval

    Get PDF
    PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

    HIERARCHICAL LEARNING OF DISCRIMINATIVE FEATURES AND CLASSIFIERS FOR LARGE-SCALE VISUAL RECOGNITION

    Get PDF
    Enabling computers to recognize objects present in images has been a long standing but tremendously challenging problem in the field of computer vision for decades. Beyond the difficulties resulting from huge appearance variations, large-scale visual recognition poses unprecedented challenges when the number of visual categories being considered becomes thousands, and the amount of images increases to millions. This dissertation contributes to addressing a number of the challenging issues in large-scale visual recognition. First, we develop an automatic image-text alignment method to collect massive amounts of labeled images from the Web for training visual concept classifiers. Specif- ically, we first crawl a large number of cross-media Web pages containing Web images and their auxiliary texts, and then segment them into a collection of image-text pairs. We then show that near-duplicate image clustering according to visual similarity can significantly reduce the uncertainty on the relatedness of Web images’ semantics to their auxiliary text terms or phrases. Finally, we empirically demonstrate that ran- dom walk over a newly proposed phrase correlation network can help to achieve more precise image-text alignment by refining the relevance scores between Web images and their auxiliary text terms. Second, we propose a visual tree model to reduce the computational complexity of a large-scale visual recognition system by hierarchically organizing and learning the classifiers for a large number of visual categories in a tree structure. Compared to previous tree models, such as the label tree, our visual tree model does not require training a huge amount of classifiers in advance which is computationally expensive. However, we experimentally show that the proposed visual tree achieves results that are comparable or even better to other tree models in terms of recognition accuracy and efficiency. Third, we present a joint dictionary learning (JDL) algorithm which exploits the inter-category visual correlations to learn more discriminative dictionaries for image content representation. Given a group of visually correlated categories, JDL simul- taneously learns one common dictionary and multiple category-specific dictionaries to explicitly separate the shared visual atoms from the category-specific ones. We accordingly develop three classification schemes to make full use of the dictionaries learned by JDL for visual content representation in the task of image categoriza- tion. Experiments on two image data sets which respectively contain 17 and 1,000 categories demonstrate the effectiveness of the proposed algorithm. In the last part of the dissertation, we develop a novel data-driven algorithm to quantitatively characterize the semantic gaps of different visual concepts for learning complexity estimation and inference model selection. The semantic gaps are estimated directly in the visual feature space since the visual feature space is the common space for concept classifier training and automatic concept detection. We show that the quantitative characterization of the semantic gaps helps to automatically select more effective inference models for classifier training, which further improves the recognition accuracy rates

    Contributions for the automatic description of multimodal scenes

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Music information retrieval: conceptuel framework, annotation and user behaviour

    Get PDF
    Understanding music is a process both based on and influenced by the knowledge and experience of the listener. Although content-based music retrieval has been given increasing attention in recent years, much of the research still focuses on bottom-up retrieval techniques. In order to make a music information retrieval system appealing and useful to the user, more effort should be spent on constructing systems that both operate directly on the encoding of the physical energy of music and are flexible with respect to users’ experiences. This thesis is based on a user-centred approach, taking into account the mutual relationship between music as an acoustic phenomenon and as an expressive phenomenon. The issues it addresses are: the lack of a conceptual framework, the shortage of annotated musical audio databases, the lack of understanding of the behaviour of system users and shortage of user-dependent knowledge with respect to high-level features of music. In the theoretical part of this thesis, a conceptual framework for content-based music information retrieval is defined. The proposed conceptual framework - the first of its kind - is conceived as a coordinating structure between the automatic description of low-level music content, and the description of high-level content by the system users. A general framework for the manual annotation of musical audio is outlined as well. A new methodology for the manual annotation of musical audio is introduced and tested in case studies. The results from these studies show that manually annotated music files can be of great help in the development of accurate analysis tools for music information retrieval. Empirical investigation is the foundation on which the aforementioned theoretical framework is built. Two elaborate studies involving different experimental issues are presented. In the first study, elements of signification related to spontaneous user behaviour are clarified. In the second study, a global profile of music information retrieval system users is given and their description of high-level content is discussed. This study has uncovered relationships between the users’ demographical background and their perception of expressive and structural features of music. Such a multi-level approach is exceptional as it included a large sample of the population of real users of interactive music systems. Tests have shown that the findings of this study are representative of the targeted population. Finally, the multi-purpose material provided by the theoretical background and the results from empirical investigations are put into practice in three music information retrieval applications: a prototype of a user interface based on a taxonomy, an annotated database of experimental findings and a prototype semantic user recommender system. Results are presented and discussed for all methods used. They show that, if reliably generated, the use of knowledge on users can significantly improve the quality of music content analysis. This thesis demonstrates that an informed knowledge of human approaches to music information retrieval provides valuable insights, which may be of particular assistance in the development of user-friendly, content-based access to digital music collections

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Discrete language models for video retrieval

    Get PDF
    Finding relevant video content is important for producers of television news, documentanes and commercials. As digital video collections become more widely available, content-based video retrieval tools will likely grow in importance for an even wider group of users. In this thesis we investigate language modelling approaches, that have been the focus of recent attention within the text information retrieval community, for the video search task. Language models are smoothed discrete generative probability distributions generally of text and provide a neat information retrieval formalism that we believe is equally applicable to traditional visual features as to text. We propose to model colour, edge and texture histogrambased features directly with discrete language models and this approach is compatible with further traditional visual feature representations. We provide a comprehensive and robust empirical study of smoothing methods, hierarchical semantic and physical structures, and fusion methods for this language modelling approach to video retrieval. The advantage of our approach is that it provides a consistent, effective and relatively efficient model for video retrieval
    corecore