7 research outputs found

    Enhanced Random Forest with Image/Patch-Level Learning for Image Understanding

    Full text link
    Image understanding is an important research domain in the computer vision due to its wide real-world applications. For an image understanding framework that uses the Bag-of-Words model representation, the visual codebook is an essential part. Random forest (RF) as a tree-structure discriminative codebook has been a popular choice. However, the performance of the RF can be degraded if the local patch labels are poorly assigned. In this paper, we tackle this problem by a novel way to update the RF codebook learning for a more discriminative codebook with the introduction of the soft class labels, estimated from the pLSA model based on a feedback scheme. The feedback scheme is performed on both the image and patch levels respectively, which is in contrast to the state- of-the-art RF codebook learning that focused on either image or patch level only. Experiments on 15-Scene and C-Pascal datasets had shown the effectiveness of the proposed method in image understanding task.Comment: Accepted in ICPR 2014 (Oral

    IST Austria Thesis

    Get PDF
    The human ability to recognize objects in complex scenes has driven research in the computer vision field over couple of decades. This thesis focuses on the object recognition task in images. That is, given the image, we want the computer system to be able to predict the class of the object that appears in the image. A recent successful attempt to bridge semantic understanding of the image perceived by humans and by computers uses attribute-based models. Attributes are semantic properties of the objects shared across different categories, which humans and computers can decide on. To explore the attribute-based models we take a statistical machine learning approach, and address two key learning challenges in view of object recognition task: learning augmented attributes as mid-level discriminative feature representation, and learning with attributes as privileged information. Our main contributions are parametric and non-parametric models and algorithms to solve these frameworks. In the parametric approach, we explore an autoencoder model combined with the large margin nearest neighbor principle for mid-level feature learning, and linear support vector machines for learning with privileged information. In the non-parametric approach, we propose a supervised Indian Buffet Process for automatic augmentation of semantic attributes, and explore the Gaussian Processes classification framework for learning with privileged information. A thorough experimental analysis shows the effectiveness of the proposed models in both parametric and non-parametric views

    Characterizing Objects in Images using Human Context

    Get PDF
    Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context

    Semi-supervised learning for image classification

    Get PDF
    Object class recognition is an active topic in computer vision still presenting many challenges. In most approaches, this task is addressed by supervised learning algorithms that need a large quantity of labels to perform well. This leads either to small datasets (< 10,000 images) that capture only a subset of the real-world class distribution (but with a controlled and verified labeling procedure), or to large datasets that are more representative but also add more label noise. Therefore, semi-supervised learning is a promising direction. It requires only few labels while simultaneously making use of the vast amount of images available today. We address object class recognition with semi-supervised learning. These algorithms depend on the underlying structure given by the data, the image description, and the similarity measure, and the quality of the labels. This insight leads to the main research questions of this thesis: Is the structure given by labeled and unlabeled data more important than the algorithm itself? Can we improve this neighborhood structure by a better similarity metric or with more representative unlabeled data? Is there a connection between the quality of labels and the overall performance and how can we get more representative labels? We answer all these questions, i.e., we provide an extensive evaluation, we propose several graph improvements, and we introduce a novel active learning framework to get more representative labels.Objektklassifizierung ist ein aktives Forschungsgebiet in maschineller Bildverarbeitung was bisher nur unzureichend gelöst ist. Die meisten Ansätze versuchen die Aufgabe durch überwachtes Lernen zu lösen. Aber diese Algorithmen benötigen eine hohe Anzahl von Trainingsdaten um gut zu funktionieren. Das führt häufig entweder zu sehr kleinen Datensätzen (< 10,000 Bilder) die nicht die reale Datenverteilung einer Klasse wiedergeben oder zu sehr grossen Datensätzen bei denen man die Korrektheit der Labels nicht mehr garantieren kann. Halbüberwachtes Lernen ist eine gute Alternative zu diesen Methoden, da sie nur sehr wenige Labels benötigen und man gleichzeitig Datenressourcen wie das Internet verwenden kann. In dieser Arbeit adressieren wir Objektklassifizierung mit halbüberwachten Lernverfahren. Diese Algorithmen sind sowohl von der zugrundeliegenden Struktur, die sich aus den Daten, der Bildbeschreibung und der Distanzmasse ergibt, als auch von der Qualität der Labels abhängig. Diese Erkenntnis hat folgende Forschungsfragen aufgeworfen: Ist die Struktur wichtiger als der Algorithmus selbst? Können wir diese Struktur gezielt verbessern z.B. durch eine bessere Metrik oder durch mehr Daten? Gibt es einen Zusammenhang zwischen der Qualität der Labels und der Gesamtperformanz der Algorithmen? In dieser Arbeit beantworten wir diese Fragen indem wir diese Methoden evaluieren. Ausserdem entwickeln wir neue Methoden um die Graphstruktur und die Labels zu verbessern

    Combining visual recognition and computational linguistics : linguistic knowledge for visual recognition and natural language descriptions of visual content

    Get PDF
    Extensive efforts are being made to improve visual recognition and semantic understanding of language. However, surprisingly little has been done to exploit the mutual benefits of combining both fields. In this thesis we show how the different fields of research can profit from each other. First, we scale recognition to 200 unseen object classes and show how to extract robust semantic relatedness from linguistic resources. Our novel approach extends zero-shot to few shot recognition and exploits unlabeled data by adopting label propagation for transfer learning. Second, we capture the high variability but low availability of composite activity videos by extracting the essential information from text descriptions. For this we recorded and annotated a corpus for fine-grained activity recognition. We show improvements in a supervised case but we are also able to recognize unseen composite activities. Third, we present a corpus of videos and aligned descriptions. We use it for grounding activity descriptions and for learning how to automatically generate natural language descriptions for a video. We show that our proposed approach is also applicable to image description and that it outperforms baselines and related work. In summary, this thesis presents a novel approach for automatic video description and shows the benefits of extracting linguistic knowledge for object and activity recognition as well as the advantage of visual recognition for understanding activity descriptions.Trotz umfangreicher Anstrengungen zur Verbesserung der die visuelle Erkennung und dem automatischen Verständnis von Sprache, ist bisher wenig getan worden, um diese beiden Forschungsbereiche zu kombinieren. In dieser Dissertation zeigen wir, wie beide voneinander profitieren können. Als erstes skalieren wir Objekterkennung zu 200 ungesehen Klassen und zeigen, wie man robust semantische Ähnlichkeiten von Sprachressourcen extrahiert. Unser neuer Ansatz kombiniert Transfer und halbüberwachten Lernverfahren und kann so Daten ohne Annotation ausnutzen und mit keinen als auch mit wenigen Trainingsbeispielen auskommen. Zweitens erfassen wir die hohe Variabilität aber geringe Verfügbarkeit von Videos mit zusammengesetzten Aktivitäten durch Extraktion der wesentlichen Informationen aus Textbeschreibungen. Wir verbessern überwachtes Training als auch die Erkennung von ungesehenen Aktivitäten. Drittens stellen wir einen parallelen Datensatz von Videos und Beschreibungen vor. Wir verwenden ihn für Grounding von Aktivitätsbeschreibungen und um die automatische Generierung natürlicher Sprache für ein Video zu erlernen. Wir zeigen, dass sich unsere Ansatz auch für Bildbeschreibung einsetzten lässt und das er bisherige Ansätze übertrifft. Zusammenfassend stellt die Dissertation einen neuen Ansatz zur automatische Videobeschreibung vor und zeigt die Vorteile von sprachbasierten Ähnlichkeitsmaßen für die Objekt- und Aktivitätserkennung als auch umgekehrt

    Extracting Structures in Image Collections for Object Recognition

    No full text
    Many computer vision methods rely on annotated image databases without taking advantage of the increasing number of unlabeled images available. This paper explores an alternative approach involving unsupervised structure discovery and semi-supervised learning (SSL) in image collections. Focusing on object classes, the first part of the paper contributes with an extensive evaluation of state-of-the-art image representations underlining the decisive influence of the local neighborhood structure, its direct consequences on SSL results, and the importance of developing powerful object representations. In a second part, we propose and explore promising directions to improve results by looking at the local topology between images and feature combination strategies
    corecore