185 research outputs found

    Learning Mid-Level Representations for Visual Recognition

    Get PDF
    The objective of this thesis is to enhance visual recognition for objects and scenes through the development of novel mid-level representations and appendent learning algorithms. In particular, this work is focusing on category level recognition which is still a very challenging and mainly unsolved task. One crucial component in visual recognition systems is the representation of objects and scenes. However, depending on the representation, suitable learning strategies need to be developed that make it possible to learn new categories automatically from training data. Therefore, the aim of this thesis is to extend low-level representations by mid-level representations and to develop suitable learning mechanisms. A popular kind of mid-level representations are higher order statistics such as self-similarity and co-occurrence statistics. While these descriptors are satisfying the demand for higher-level object representations, they are also exhibiting very large and ever increasing dimensionality. In this thesis a new object representation, based on curvature self-similarity, is suggested that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, it also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for support vector machines that reduces complexity and improves generalization capability of object models. The proposed curvature self-similarity representation is successfully integrated together with the embedded feature selection in a widely used state-of-the-art object detection framework. The influence of higher order statistics for category level object recognition, is further investigated by learning co-occurrences between foreground and background, to reduce the number of false detections. While the suggested curvature self-similarity descriptor is improving the model for more detailed description of the foreground, higher order statistics are now shown to be also suitable for explicitly modeling the background. This is of particular use for the popular chamfer matching technique, since it is prone to accidental matches in dense clutter. As clutter only interferes with the foreground model contour, we learn where to place the background contours with respect to the foreground object boundary. The co-occurrence of background contours is integrated into a max-margin framework. Thus the suggested approach combines the advantages of accurately detecting object parts via chamfer matching and the robustness of max-margin learning. While chamfer matching is very efficient technique for object detection, parts are only detected based on a simple distance measure. Contrary to that, mid-level parts and patches are explicitly trained to distinguish true positives in the foreground from false positives in the background. Due to the independence of mid-level patches and parts it is possible to train a large number of instance specific part classifiers. This is contrary to the current most powerful discriminative approaches that are typically only feasible for a small number of parts, as they are modeling the spatial dependencies between them. Due to their number, we cannot directly train a powerful classifier to combine all parts. Instead, parts are randomly grouped into fewer, overlapping compositions that are trained using a maximum-margin approach. In contrast to the common rationale of compositional approaches, we do not aim for semantically meaningful ensembles. Rather we seek randomized compositions that are discriminative and generalize over all instances of a category. Compositions are all combined by a non-linear decision function which is completing the powerful hierarchy of discriminative classifiers. In summary, this thesis is improving visual recognition of objects and scenes, by developing novel mid-level representations on top of different kinds of low-level representations. Furthermore, it investigates in the development of suitable learning algorithms, to deal with the new challenges that are arising form the novel object representations presented in this work

    Beyond the Sum of Parts: Shape-based Object Detection and its Applications

    Get PDF
    The grand goal of Computer Vision is to generate an automatic description of an image based on its visual content. Such a description would lead to many exciting capabilities, for example, searching through the images based on their visual content rather than the textual tags attached to the images. Images and videos take an ever increasing share of the total information content in archives and on the internet. Hence, such automatic descriptions would provide powerful tools for organizing and indexing by means of the visual content. Category level object detection is an important step in generating such automatic image descriptions. The major part of this thesis addresses the problems encountered in popular lines of approaches which utilize shape in various ways for object detection namely, i) Hough Voting, ii) Contour based Object Detection and iii) Chamfer Matching. The problems are tackled using the principles of emergence which states that the whole is more than the sum of its parts. Hough Voting methods are popular because they efficiently handle the high complexity of multi-scale, category-level object detection in cluttered scenes. However, the primary weakness of this approach is that mutually dependent local observations independently vote for intrinsically global object properties such as object scale. All the votes are added up to obtain object hypotheses. The assumption is thus that object hypotheses are a sum of independent part votes. Popular representation schemes are, however, based on an overlapping sampling of semi-local image features with large spatial support (e.g. SIFT or geometric blur). Features are thus mutually dependent. The question arises as to how to incorporate the feature dependences into Hough Voting framework. In this thesis, the feature dependencies are modelled by an objective function that combines three intimately related problems: i) grouping of mutually dependent parts, ii) solving the correspondence problem conjointly for dependent parts, and iii) finding concerted object hypotheses using extended groups rather than based on local observations alone. While Voting with dependent groups brings a significant improvement over standard Hough Voting, the interest points are still grouped in a query image during the detection stage. The grouping process can be made robust by grouping densely sampled interest points in training images yielding contours and evaluating the utility of contours over the full ensemble of training images. However, contour based object detection poses significant challenges for category-level object detection in cluttered scenes: Object form is an emergent property that cannot be perceived locally but becomes only available once the whole object has been detected and segregated from the background. To tackle this challenge, this thesis addresses the detection of objects and the assembling of their shape simultaneously, while avoiding fragile bottom-up grouping in query images altogether. Rather, the challenging problems of finding meaningful contours and discovering their spatially consistent placement are both shifted into the training stage. These challenges can be better handled using an ensemble of training samples rather than just a single query image. A dictionary of meaningful contours is then discovered using grouping based on co-activation patterns in all training images. Spatially consistent compositions of all contours are learned using maximum margin multiple instance learning. During recognition, objects are detected and their shape is explained simultaneously by optimizing a single cost function. For finding the placement of an object template or its part in an edge map, Chamfer matching is a widely used technique because of its simplicity and speed. However, it treats objects as being a mere sum of the distance transformation of all their contour pixels, thus leading to spurious matches. This thesis takes account of the fact that boundary pixels are not all equally important by applying a discriminative approach to chamfer distance computation, thereby increasing its robustness. While this improves the behaviour in the foreground, chamfer matching is still prone to accidental responses in spurious background clutter. To estimate the accidentalness of a match, a small dictionary of simple background contours is utilized. These background elements are trained to focus at locations where, relative to the foreground, typically accidental matches occur. Finally, a max-margin classifier is employed to learn the co-placement of all background contours and the foreground template. Both the contributions bring significant improvements over state-of-the-art chamfer matching on standard benchmark datasets. The final part of the thesis presents a case study where shape-based object representations provided semantic understanding of medieval manuscripts to art historians. To carry out the case study, a novel image dataset has been assembled from illuminations of 15th century manuscripts with ground-truth information about various objects of artistic interest such as crowns, swords. An approach has been developed for automatically extracting potential objects (for e.g. crowns) from the large image collection, then analysing the intra-class variability of objects by means of a low dimensional embedding. With the help of the resultant plot, the art historians were able to confirm different artistic workshops within the manuscript and could verify the variations of art within a particular school. Obtaining such insights manually is a tedious task and one has to go through and analyse all the object types from all the pages of the manuscript. In addition, a semi-supervised approach has been developed for analysing the variations within an artistic workshop, and extended further to understand the transitions across artistic styles by means of 1-d ordering of objects

    Stability and Expressiveness of Deep Generative Models

    Get PDF
    In den letzten Jahren hat Deep Learning sowohl das maschinelle Lernen als auch die maschinelle Bildverarbeitung revolutioniert. Viele klassische Computer Vision-Aufgaben, wie z.B. die Objekterkennung und semantische Segmentierung, die traditionell sehr anspruchsvoll waren, können nun mit Hilfe von überwachten Deep Learning-Techniken gelöst werden. Überwachtes Lernen ist ein mächtiges Werkzeug, wenn annotierte Daten verfügbar sind und die betrachtete Aufgabe eine eindeutige Lösung hat. Diese Bedingungen sind allerdings nicht immer erfüllt. Ein vielversprechender Ansatz ist in diesem Fall die generative Modellierung. Im Gegensatz zu rein diskriminativen Modellen können generative Modelle mit Unsicherheiten umgehen und leistungsfähige Modelle lernen, auch wenn keine annotierten Trainingsdaten verfügbar sind. Obwohl aktuelle Ansätze zur generativen Modellierung vielversprechende Ergebnisse erzielen, beeinträchtigen zwei Aspekte ihre Expressivität: (i) Einige der erfolgreichsten Ansätze zur Modellierung von Bilddaten werden nicht mehr mit Hilfe von Optimierungsalgorithmen trainiert, sondern mit Algorithmen, deren Dynamik bisher nicht gut verstanden wurde. (ii) Generative Modelle sind oft durch den Speicherbedarf der Ausgaberepräsentation begrenzt. In dieser Arbeit gehen wir auf beide Probleme ein: Im ersten Teil der Arbeit stellen wir eine Theorie vor, die es erlaubt, die Trainingsdynamik von Generative Adversarial Networks (GANs), einem der vielversprechendsten Ansätze zur generativen Modellierung, besser zu verstehen. Wir nähern uns dieser Problemstellung, indem wir minimale Beispielprobleme des GAN-Trainings vorstellen, die analytisch verstanden werden können. Anschließend erhöhen wir schrittweise die Komplexität dieser Beispiele. Dadurch gewinnen wir neue Einblicke in die Trainingsdynamik von GANs und leiten neue Regularisierer her, die auch für allgemeine GANs sehr gut funktionieren. Insbesondere ermöglichen unsere neuen Regularisierer erstmals, ein GAN mit einer Auflösung von einem Megapixel zu trainieren, ohne dass wir die Auflösung der Trainingsverteilung schrittweise erhöhen müssen. Im zweiten Teil dieser Arbeit betrachten wir Ausgaberepräsentationen für generative Modelle in 3D und für 3D-Rekonstruktionstechniken. Durch die Einführung von impliziten Repräsentationen sind wir in der Lage, viele Techniken, die in 2D funktionieren, auf den 3D-Bereich auszudehnen ohne ihre Expressivität einzuschränken.In recent years, deep learning has revolutionized both machine learning and computer vision. Many classical computer vision tasks (e.g. object detection and semantic segmentation), which traditionally were very challenging, can now be solved using supervised deep learning techniques. While supervised learning is a powerful tool when labeled data is available and the task under consideration has a well-defined output, these conditions are not always satisfied. One promising approach in this case is given by generative modeling. In contrast to purely discriminative models, generative models can deal with uncertainty and learn powerful models even when labeled training data is not available. However, while current approaches to generative modeling achieve promising results, they suffer from two aspects that limit their expressiveness: (i) some of the most successful approaches to modeling image data are no longer trained using optimization algorithms, but instead employ algorithms whose dynamics are not well understood and (ii) generative models are often limited by the memory requirements of the output representation. We address both problems in this thesis: in the first part we introduce a theory which enables us to better understand the training dynamics of Generative Adversarial Networks (GANs), one of the most promising approaches to generative modeling. We tackle this problem by introducing minimal example problems of GAN training which can be understood analytically. Subsequently, we gradually increase the complexity of these examples. By doing so, we gain new insights into the training dynamics of GANs and derive new regularizers that also work well for general GANs. Our new regularizers enable us - for the first time - to train a GAN at one megapixel resolution without having to gradually increase the resolution of the training distribution. In the second part of this thesis we consider output representations in 3D for generative models and 3D reconstruction techniques. By introducing implicit representations to deep learning, we are able to extend many techniques that work in 2D to the 3D domain without sacrificing their expressiveness
    • …
    corecore