43 research outputs found
Toward Large Scale Semantic Image Understanding and Retrieval
Semantic image retrieval is a multifaceted, highly complex problem. Not only does the solution to this problem require advanced image processing and computer vision techniques, but it also requires knowledge beyond what can be inferred from the image content alone. In contrast, traditional image retrieval systems are based upon keyword searches on filenames or metadata tags, e.g. Google image search, Flickr search, etc. These conventional systems do not analyze the image content and their keywords are not guaranteed to represent the image. Thus, there is significant need for a semantic image retrieval system that can analyze and retrieve images based upon the content and relationships that exist in the real world.In this thesis, I present a framework that moves towards advancing semantic image retrieval in large scale datasets. At a conceptual level, semantic image retrieval requires the following steps: viewing an image, understanding the content of the image, indexing the important aspects of the image, connecting the image concepts to the real world, and finally retrieving the images based upon the index concepts or related concepts. My proposed framework addresses each of these components in my ultimate goal of improving image retrieval. The first task is the essential task of understanding the content of an image. Unfortunately, typically the only data used by a computer algorithm when analyzing images is the low-level pixel data. But, to achieve human level comprehension, a machine must overcome the semantic gap, or disparity that exists between the image data and human understanding. This translation of the low-level information into a high-level representation is an extremely difficult problem that requires more than the image pixel information. I describe my solution to this problem through the use of an online knowledge acquisition and storage system. This system utilizes the extensible, visual, and interactable properties of Scalable Vector Graphics (SVG) combined with online crowd sourcing tools to collect high level knowledge about visual content.I further describe the utilization of knowledge and semantic data for image understanding. Specifically, I seek to incorporate knowledge in various algorithms that cannot be inferred from the image pixels alone. This information comes from related images or structured data (in the form of hierarchies and ontologies) to improve the performance of object detection and image segmentation tasks. These understanding tasks are crucial intermediate steps towards retrieval and semantic understanding. However, the typical object detection and segmentation tasks requires an abundance of training data for machine learning algorithms. The prior training information provides information on what patterns and visual features the algorithm should be looking for when processing an image. In contrast, my algorithm utilizes related semantic images to extract the visual properties of an object and also to decrease the search space of my detection algorithm. Furthermore, I demonstrate the use of related images in the image segmentation process. Again, without the use of prior training data, I present a method for foreground object segmentation by finding the shared area that exists in a set of images. I demonstrate the effectiveness of my method on structured image datasets that have defined relationships between classes i.e. parent-child, or sibling classes.Finally, I introduce my framework for semantic image retrieval. I enhance the proposed knowledge acquisition and image understanding techniques with semantic knowledge through linked data and web semantic languages. This is an essential step in semantic image retrieval. For example, a car class classified by an image processing algorithm not enhanced by external knowledge would have no idea that a car is a type of vehicle which would also be highly related to a truck and less related to other transportation methods like a train . However, a query for modes of human transportation should return all of the mentioned classes. Thus, I demonstrate how to integrate information from both image processing algorithms and semantic knowledge bases to perform interesting queries that would otherwise be impossible. The key component of this system is a novel property reasoner that is able to translate low level image features into semantically relevant object properties. I use a combination of XML based languages such as SVG, RDF, and OWL in order to link to existing ontologies available on the web. My experiments demonstrate an efficient data collection framework and novel utilization of semantic data for image analysis and retrieval on datasets of people and landmarks collected from sources such as IMDB and Flickr. Ultimately, my thesis presents improvements to the state of the art in visual knowledge representation/acquisition and computer vision algorithms such as detection and segmentation toward the goal of enhanced semantic image retrieval
Recommended from our members
Human machine collaboration for foreground segmentation in images and videos
Foreground segmentation is defined as the problem of generating pixel level foreground masks for all the objects in a given image or video. Accurate foreground segmentations in images and videos have several potential applications such as improving search, training richer object detectors, image synthesis and re-targeting, scene and activity understanding, video summarization, and post-production video editing.
One effective way to solve this problem is human-machine collaboration. The main idea is to let humans guide the segmentation process through some partial supervision. As humans, we are extremely good at perception and can easily identify the foreground regions. Computers, on the other hand, lack this capability, but are extremely good at continuously processing large volumes of data at the lowest level of detail with great efficiency. Bringing these complementary strengths together can lead to systems which are accurate and cost-effective at the same time. However, in any such human-machine collaboration system, cost effectiveness and higher accuracy are competing goals. While more involvement from humans can certainly lead to higher accuracy, it also leads to increased cost both in terms of time and money. On the other hand, relying more on machines is cost-effective, but algorithms are still nowhere near human-level performance. Balancing this cost versus accuracy trade-off holds the key behind success for such a hybrid system.
In this thesis, I develop foreground segmentation algorithms which effectively and efficiently make use of human guidance for accurately segmenting foreground objects in images and videos. The algorithms developed in this thesis actively reason about the best modalities or interactions through which a user can provide guidance to the system for generating accurate segmentations. At the same time, these algorithms are also capable of prioritizing human guidance on instances where it is most needed. Finally, when structural similarity exists within data (e.g., adjacent frames in a video or similar images in a collection), the algorithms developed in this thesis are capable of propagating information from instances which have received human guidance to the ones which did not. Together, these characteristics result in a substantial savings in human annotation cost while generating high quality foreground segmentations in images and videos.
In this thesis, I consider three categories of segmentation problems all of which can greatly benefit from human-machine collaboration. First, I consider the problem of interactive image segmentation. In traditional interactive methods a human annotator provides a coarse spatial annotation (e.g., bounding box or freehand outlines) around the object of interest to obtain a segmentation. The mode of manual annotation used affects both its accuracy and ease-of-use. Whereas existing methods assume a fixed form of input no matter the image, in this thesis I propose a data-driven algorithm which learns whether an interactive segmentation method will succeed if initialized with a given annotation mode. This allows us to predict the modality that will be sufficiently strong to yield a high quality segmentation for a given image and results in large savings in annotation costs. I also propose a novel interactive segmentation algorithm called Click Carving which can accurately segment objects in images and videos using a very simple form of human interaction---point clicks. It outperforms several state-of-the-art methods and requires only a fraction of human effort in comparison.
Second, I consider the problem of segmenting images in a weakly supervised image collection. Here, we are given a collection of images all belonging to the same object category and the goal is to jointly segment the common object from all the images. For this, I develop a stagewise active approach to segmentation propagation: in each stage, the images that appear most valuable for human annotation are actively determined and labeled by human annotators, then the foreground estimates are revised in all unlabeled images accordingly. In order to identify images that, once annotated, will propagate well to other examples, I introduce an active selection procedure that operates on the joint segmentation graph over all images. It prioritizes human intervention for those images that are uncertain and influential in the graph, while also mutually diverse. Building on this, I also introduce the problem of measuring compatibility between image pairs for joint segmentation. I show that restricting the joint segmentation to only compatible image pairs results in an improved joint segmentation performance.
Finally, I propose a semi-supervised approach for segmentation propagation in video. Given human supervision in some frames of a video, this information can be propagated through time. The main challenge is that the foreground object may move quickly in the scene at the same time its appearance and shape evolves over time. To address this, I propose a higher order supervoxel label consistency potential which leverages bottom-up supervoxels to enforce long-range temporal consistency during propagation. I also introduce the notion of a generic pixel-level objectness in images and videos by training a deep neural network which uses appearance and motion to automatically assign a score to each pixel capturing its likelihood to be an "object" or "background". I show that the human guidance in the semi-supervised propagation algorithm can be further augmented with the generic pixel-objectness scores to obtain an even more accurate foreground segmentation in videos.
Throughout, I provide extensive evaluation on challenging datasets and also compare with many state-of-the-art methods and other baselines validating the strengths of proposed algorithms. The outcomes across several different experiments show that the proposed human-machine collaboration algorithms achieve accurate segmentation of foreground objects in images and videos while saving a large amount of human annotation effort.Computer Science
Segmentation multi-vues d'objet
There has been a growing interest for multi-camera systems and many interesting works have tried to tackle computer vision problems in this particular configuration. The general objective is to propose new multi-view oriented methods instead of applying limited monocular approaches independently for each viewpoint. The work in this thesis is an attempt to have a better understanding of the multi-view object segmentation problem and to propose an alternative approach making maximum use of the available information from different viewpoints.Multiple view segmentation consists in segmenting objects simultaneously in several views. Classic monocular segmentation approaches reason on a single image and do not benefit from the presence of several viewpoints. A key issue in that respect is to ensure propagation of segmentation information between views while minimizing complexity and computational cost. In this work, we first investigate the idea that examining measurements at the projections of a sparse set of 3D points is sufficient to achieve this goal. The proposed algorithm softly assigns each of these 3D samples to the scene background if it projects on the background region in at least one view, or to the foreground if it projects on foreground region in all views. A complete probabilistic framework is proposed to estimate foreground/background color models and the method is tested on various datasets from state of the art.Two different extensions of the sparse 3D sampling segmentation framework are proposed in two scenarios. In the first, we show the flexibility of the sparse sampling framework, by using variational inference to integrate Gaussian mixture models as appearance models. In the second scenario, we propose a study of how to incorporate depth measurements in multi-view segmentation. We present a quantitative evaluation, showing that typical color-based segmentation robustness issues due to color-space ambiguity between foreground and background, can be at least partially mitigated by using depth, and that multi-view color depth segmentation also improves over monocular color depth segmentation strategies.The various tests also showed the limitations of the proposed 3D sparse sampling approach which was the motivation to propose a new method based on a richer description of image regions using superpixels. This model, that expresses more subtle relationships of the problem trough a graph construction linking superpixels and 3D samples, is one of the contributions of this work. In this new framework, time related information is also integrated. With static views, results compete with state of the art methods but they are achieved with significantly fewer viewpoints. Results on videos demonstrate the benefit of segmentation propagation through geometric and temporal cues.Finally, the last part of the thesis explores the possibilities of tracking in uncalibrated multi-view scenarios. A summary of existing methods in this field is presented, in both mono-camera and multi-camera scenarios. We investigate the potential of using self-similarity matrices to describe and compare motion in the context of multi-view tracking.L’utilisation de systèmes multi-caméras est de plus en plus populaire et il y a un intérêt croissant à résoudre les problèmes de vision par ordinateur dans ce contexte particulier. L’objectif étant de ne pas se limiter à l’application des méthodes monoculaires mais de proposer de nouvelles approches intrinsèquement orientées vers les systèmes multi-caméras. Le travail de cette thèse a pour objectif une meilleure compréhension du problème de segmentation multi-vues, pour proposer une nouvelle approche qui tire meilleur parti de la redondance d’information inhérente à l’utilisation de plusieurs points de vue.La segmentation multi-vues est l’identification de l’objet observé simultanément dans plusieurs caméras et sa séparation de l’arrière-plan. Les approches monoculaires classiques raisonnent sur chaque image de manière indépendante et ne bénéficient pas de la présence de plusieurs points de vue. Une question clé de la segmentation multi-vues réside dans la propagation d’information sur la segmentation entres les images tout en minimisant la complexité et le coût en calcul. Dans ce travail, nous investiguons en premier lieu l’utilisation d’un ensemble épars d’échantillons de points 3D. L’algorithme proposé classe chaque point comme "vide" s’il se projette sur une région du fond et "occupé" s’il se projette sur une région avant-plan dans toutes les vues. Un modèle probabiliste est proposé pour estimer les modèles de couleur de l’avant-plan et de l’arrière-plan, que nous testons sur plusieurs jeux de données de l’état de l’art.Deux extensions du modèle sont proposées. Dans la première, nous montrons la flexibilité de la méthode proposée en intégrant les mélanges de Gaussiennes comme modèles d’apparence. Cette intégration est possible grâce à l’utilisation de l’inférence variationelle. Dans la seconde, nous montrons que le modèle bayésien basé sur les échantillons 3D peut aussi être utilisé si des mesures de profondeur sont présentes. Les résultats de l’évaluation montrent que les problèmes de robustesse, typiquement causés par les ambigüités couleurs entre fond et forme, peuvent être au moins partiellement résolus en utilisant cette information de profondeur. A noter aussi qu’une approche multi-vues reste meilleure qu’une méthode monoculaire utilisant l’information de profondeur.Les différents tests montrent aussi les limitations de la méthode basée sur un échantillonnage éparse. Cela a montré la nécessité de proposer un modèle reposant sur une description plus riche de l’apparence dans les images, en particulier en utilisant les superpixels. L’une des contributions de ce travail est une meilleure modélisation des contraintes grâce à un schéma par coupure de graphes liant les régions d’images aux échantillons 3D. Dans le cas statique, les résultats obtenus rivalisent avec ceux de l’état de l’art mais sont obtenus avec beaucoup moins de points de vue. Les résultats dans le cas dynamique montrent l’intérêt de la propagation de l’information de segmentation à travers la géométrie et le mouvement.Enfin, la dernière partie de cette thèse explore la possibilité d’améliorer le suivi dans les systèmes multi-caméras non calibrés. Un état de l’art sur le suivi monoculaire et multi-caméras est présenté et nous explorons l’utilisation des matrices d’autosimilarité comme moyen de décrire le mouvement et de le comparer entre plusieurs caméras