    Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts

    Formulations of the Image Decomposition Problem as a Multicut Problem (MP) w.r.t. a superpixel graph have received considerable attention. In contrast, instances of the MP w.r.t. a pixel grid graph have received little attention, firstly, because the MP is NP-hard and instances w.r.t. a pixel grid graph are hard to solve in practice, and, secondly, due to the lack of long-range terms in the objective function of the MP. We propose a generalization of the MP with long-range terms (LMP). We design and implement two efficient algorithms (primal feasible heuristics) for the MP and LMP which allow us to study instances of both problems w.r.t. the pixel grid graphs of the images in the BSDS-500 benchmark. The decompositions we obtain do not differ significantly from the state of the art, suggesting that the LMP is a competitive formulation of the Image Decomposition Problem. To demonstrate the generality of the LMP, we apply it also to the Mesh Decomposition Problem posed by the Princeton benchmark, obtaining state-of-the-art decompositions

    Multicut Algorithms for Neurite Segmentation

    Correlation clustering, or multicut partitioning is widely used for image segmentation and graph partitioning. Given an undirected edge weighted graph with positive and negative weights, correlation clustering partitions the graph such that the sum of cut edge weights is minimized. Since the optimal number of clusters is automatically chosen, multicut partitioning is well suited for clustering neural structures in EM connectomics datasets where the optimal number of clusters is unknown a-priori. Due to the NP-hardness of optimizing the multicut objective, exact solvers do not scale and approximative solvers often give unsatisfactory results. In chapter 2 we investigate scalable methods for correlation clustering. To this end we define fusion moves for the multicut objective function which iteratively fuses the current and a proposed partitioning and monotonously improves the partitioning. Fusion moves scale to larger datasets, give near optimal solutions and at the same time show state of the art anytime performance. In chapter 3 we generalize the fusion moves frameworks for the lifted multicut ob- jective, a generalization of the multicut objective which can penalize or reward all decompositions of a graph for which any given pair of nodes are in distinct compo- nents. The proposed framework scales well to large datasets and has a cutting edge anytime performance. In chapter 4 we propose a framework for automatic segmentation of neural structures in 3D EM connectomics data where a membrane probability is predicted for each pixel with a neural network and superpixels are computed based on this probability map. Finally the superpixels are merged to neurites using the techniques described in chapter 3. The proposed pipeline is validated with an extensive set of experiments and a detailed lesion study. This work substantially narrows the accuracy gap between humans and computers for neurite segmentation. In chapter 5 we summarize the software written for this thesis. The provided imple- mentations for algorithms and techniques described in chapters 2 to 4 and many other algorithms resulted in a software library for graph partitioning, image segmentation and discrete optimization

    Lifted edges as connectivity priors for multicut and disjoint paths

    This work studies graph decompositions and their representation by 0/1 labeling of edges. We study two problems. The first is multicut (MC) which represents decompositions of undirected graphs (clustering of nodes into connected components). The second is disjoint paths (DP) in directed acyclic graphs where the clusters correspond to node- disjoint paths. Unlike an alternative representation by node labeling, the number of clusters is not part of the input but is fully determined by the costs of edges. Our main interest is to study connectivity priors represented by so-called lifted edges in the two problems. The cost of a lifted edge expresses whether its endpoints should belong to the same cluster (path) in the optimal decomposition. We call the resulting problems lifted multicut (LMC) and lifted disjoint paths (LDP). The extension of MC to LMC was originally motivated by image segmentation where the information about the connectivity between non-neighboring pixels or superpixels led to a significant quality improvement. After that, LMC was successfully applied to other problems like multiple object tracking (MOT) which is also the main application of our proposed LDP model. Our study of lifted multicut concentrates on partial LMC represented by labeling of a subset of (lifted) edges. Given partial labeling, we conclude that deciding whether a complete LMC consistent with the partial labels exists is NP-complete. Similarly, we conclude that deciding whether an unlabeled edge exists such that its label is determined by the labels of other edges is NP-hard. After that, we present metrics for comparing (partial) graph decompositions. Finally, we study the properties of the LMC polytope. The largest part of this work is dedicated to the proposed LDP problem. We prove that this problem is NP-hard and propose an optimal integer linear programming (ILP) solver. In order to enable its global optimization, we formulate several classes of linear inequalities that produce a high-quality LP relaxation. Additionally, we propose efficient cutting plane algorithms for separating the proposed linear inequalities. Despite the advanced constraints and efficient separation routines, the general time complexity of our optimal ILP solver remains exponential. In order to solve even larger instances, we introduce an approximate LDP solver based on Lagrange decomposition. LDP is a convenient model for MOT because the underlying disjoint paths model naturally leads to trajectories of objects. Moreover, lifted edges encode long-range temporal interactions and thus help to prevent id switches and re-identify persons. Our tracker using the optimal LDP solver achieves nearly optimal assignments w.r.t. input detections. Consequently, it was a leading tracker on three benchmarks of the MOT challenge MOT15/16/17, improving significantly over state-of-the-art at the time of its publication. Our approximate LDP solver enables us to process the MOT15/16/17 benchmarks without sacrificing solution quality and allows for solving large and dense instances of a challenging dataset MOT20. On all these four standard MOT benchmarks we achieved performance comparable or better than state-of-the-art methods (at the time of publication) including our tracker based on the optimal LDP solver.Diese Arbeit studiert Graphenzerlegungen und ihre ReprĂ€sentation durch 0/1-wertige Kantenbelegungen. Das erste Problem ist das Mehrfachschnittproblem. Es reprĂ€sentiert Zerlegungen von ungerichteten Graphen (Cluster von Knoten sodass jeder Cluster eine Zusammenhangskomponente reprĂ€sentiert). Das zweite Problem ist die Suche von disjunkten Pfaden in einem gerichteten azyklischen Graph in dem die Cluster knotendisjunkten Pfaden entsprechen. Im Unterschied zu der alternativen ReprĂ€sentation durch Knotenbelegungen ist die Zahl von Clustern nicht im Voraus gegeben, sondern sie ist abhĂ€ngig von den Kosten der Kanten. Der Fokus dieser Arbeit ist die Erforschung von hochgezogenen Kannten, die eine apriori Information ĂŒber Verbundenheit von Knoten in Clustern respektive durch Pfade in den zwei Problemen darstellen. Die Kosten einer hochgezogenen Kante drĂŒcken aus, ob ihre Knoten zu dem gleichen Cluster (Pfad) in der optimalen Zerlegung gehören sollten. Wir bezeichnen diese neuen Probleme als das hochgezogene Mehrfachschnittproblem und das Problem der hochgezogenen disjunkten Pfade. Die Erweiterung des Mehrfachschnittproblems zu dem hochgezogenen Mehrfachschnittproblem wurde ursprĂŒnglich durch die Bildsegmentierung motiviert, fĂŒr die die Information ĂŒber Verbundenheit von nicht benachbarten Pixeln oder Superpixeln zu einer bedeutenden Verbesserung der QualitĂ€t fĂŒhrte. Danach wurde das hochgezogene Mehrfachschnittproblem zu der Lösung von anderen Problemen wie zum Beispiel der Verfolgung von mehreren Objekten in einem Video angewendet. Diese Aufgabe ist auch die Hauptanwendung des vorgeschlagenen Problems der hochgezogenen disjunkte Pfade. In unserer Untersuchung des hochgezogenen Mehrfachschnittproblems konzentrieren wir uns auf das teilweise hochgezogene Mehrfachschnittproblem. Das Problem wird durch eine Belegung einer Teilmenge der (hochgezogenen) Kanten reprĂ€sentiert. Wir beweisen, dass es NP-vollstĂ€ndig ist zu entscheiden, ob ein kompletter hochgezogener Mehrfachschnitt existiert, der einer gegebenen teilweisen Kantenbezeichnung entspricht. In analogerWeise beweisen wir, dass es NP-schwer ist zu entscheiden, ob eine nicht belegte Kante existiert, deren Belegung durch die Belegungen anderer Kanten entschieden ist. Danach prĂ€sentieren wir Metriken zum Vergleich von (teilweisen) Graphenzerlegungen. Schließlich untersuchen wir Eigenschaften des hochgezogenen Mehrfachschnitt-Polytops. Der grĂ¶ĂŸte Teil dieser Arbeit widmet sich dem von uns vorgeschlagenen Problem der hochgezogenen disjunkten Pfade. Wir beweisen, dass es NP-schwer ist. Wir formulieren es als ein ganzzahliges lineares Optimierungsproblem und implementieren ein Programm fĂŒr dessen optimale Lösung. Um die globale Optimierung zu ermöglichen, formulieren wir mehrere Klassen von linearen Ungleichungen, die zu einer linearen Relaxierung mit einer hohen QualitĂ€t fĂŒhren. ZusĂ€tzlich prĂ€sentieren wir ein effektives Schnittebenenverfahren fĂŒr die Separierung der vorgeschlagenen Ungleichungen. Trotz der fortgeschrittenen Ungleichungen und der Effizienz der Schnittebenenseparierung in unserem optimalen Löser bleibt die allgemeine KomplexitĂ€t des Algorithmus exponentiell. Um noch kompliziertere Instanzen zu lösen, prĂ€sentieren wir einen approximativen Löser, der auf Lagrange-DualitĂ€t aufbaut. Hochgezogene disjunkte Pfade sind ein praktisches Modell fĂŒr die Verfolgung von mehreren Objekten, weil die disjunkten Pfade eine natĂŒrliche ReprĂ€sentation von Trajektorien der Objekten darstellen. Außerdem reprĂ€sentieren die hochgezogenen Kanten Interaktionen einer langen zeitlichen Reichweite. Deswegen helfen sie dieselbe Person in zeitlich weiter auseinander liegenden Zeitpunkten wieder zu identifizieren und Verwechselungen ihrer IdentitĂ€t zu verhindern. Aus diesem Grund war unsere Methode zur Zeit ihrer Publikation die beste fĂŒr drei VergleichsdatensĂ€tzen MOT Challenge MOT15/16/17 fĂŒr die Verfolgung von mehreren Objekten. Im Vergleich zu den bisherigen besten Methoden war ihre Leistung sogar bedeutend höher. Unsere approximative Methode fĂŒr hochgezogene disjunkte Pfade ermöglicht uns die VergleichsdatensĂ€tzen MOT15/16/17 zu verarbeiten ohne die QualitĂ€t der Lösungen zu vermindern und erlaubt uns, die großen Instanzen mit hoher Personendichte des anspruchsvolleren Datensatzes MOT20 zu lösen. Zur Zeit ihrer Publikation erreichte die Methode vergleichbare oder bessere Ergebnisse als die bisherigen besten Methoden einschließlich unseres optimalen Löser fĂŒr hochgezogene disjunkte Pfade

    Towards accurate multi-person pose estimation in the wild

    In this thesis we are concerned with the problem of articulated human pose estimation and pose tracking in images and video sequences. Human pose estimation is a task of localising major joints of a human skeleton in natural images and is one of the most important visual recognition tasks in the scenes containing humans with numerous applications in robotics, virtual and augmented reality, gaming and healthcare among others. Articulated human pose tracking requires tracking multiple persons in the video sequence while simultaneously estimating full body poses. This task is important for analysing surveillance footage, activity recognition, sports analytics, etc. Most of the prior work focused on the pose estimation of single pre-localised humans whereas here we address a case with multiple people in real world images which entails several challenges such as person-person overlaps in highly crowded scenes, unknown number of people or people entering and leaving video sequences. The first contribution is a multi-person pose estimation algorithm based on the bottom-up detection-by-grouping paradigm. Unlike the widespread top-down approaches our method detects body joints and pairwise relations between them in a single forward pass of a convolutional neural network. Multi-person parsing is performed by optimizing a joint objective based on a multicut graph partitioning framework. Secondly, we extend our pose estimation approach to articulated multi-person pose tracking in videos. Our approach performs multi-target tracking and pose estimation in a holistic manner by optimising a single objective. We further simplify and refine the formulation which allows us to reach close to the real-time performance. Thirdly, we propose a large scale dataset and a benchmark for articulated multi-person tracking. It is the first dataset of video sequences comprising complex multi-person scenes and fully annotated tracks with 2D keypoints. Our fourth contribution is a method for estimating 3D body pose using on-body wearable cameras. Our approach uses a pair of downward facing, head-mounted cameras and captures an entire body. This egocentric approach is free of limitations of traditional setups with external cameras and can estimate body poses in very crowded environments. Our final contribution goes beyond human pose estimation and is in the field of deep learning of 3D object shapes. In particular, we address the case of reconstructing 3D objects from weak supervision. Our approach represents objects as 3D point clouds and is able to learn them with 2D supervision only and without requiring camera pose information at training time. We design a differentiable renderer of point clouds as well as a novel loss formulation for dealing with camera pose ambiguity.In dieser Arbeit behandeln wir das Problem der SchĂ€tzung und Verfolgung artikulierter menschlicher Posen in Bildern und Video-Sequenzen. Die SchĂ€tzung menschlicher Posen besteht darin die Hauptgelenke des menschlichen Skeletts in natĂŒrlichen Bildern zu lokalisieren und ist eine der wichtigsten Aufgaben der visuellen Erkennung in Szenen, die Menschen beinhalten. Sie hat zahlreiche Anwendungen in der Robotik, virtueller und erweiterter RealitĂ€t, in Videospielen, in der Medizin und weiteren Bereichen. Die Verfolgung artikulierter menschlicher Posen erfordert die Verfolgung mehrerer Personen in einer Videosequenz bei gleichzeitiger SchĂ€tzung vollstĂ€ndiger Körperhaltungen. Diese Aufgabe ist besonders wichtig fĂŒr die Analyse von Video-Überwachungsaufnahmen, AktivitĂ€tenerkennung, digitale Sportanalyse etc. Die meisten vorherigen Arbeiten sind auf die SchĂ€tzung einzelner Posen vorlokalisierter Menschen fokussiert, wohingegen wir den Fall mehrerer Personen in natĂŒrlichen Aufnahmen betrachten. Dies bringt einige Herausforderungen mit sich, wie die Überlappung verschiedener Personen in dicht gedrĂ€ngten Szenen, eine unbekannte Anzahl an Personen oder Personen die das Sichtfeld der Video-Sequenz verlassen oder betreten. Der erste Beitrag ist ein Algorithmus zur SchĂ€tzung der Posen mehrerer Personen, welcher auf dem Paradigma der Erkennung durch Gruppierung aufbaut. Im Gegensatz zu den verbreiteten Verfeinerungs-AnsĂ€tzen erkennt unsere Methode Körpergelenke and paarweise Beziehungen zwischen ihnen in einer einzelnen VorwĂ€rtsrechnung eines faltenden neuronalen Netzwerkes. Die Gliederung in mehrere Personen erfolgt durch Optimierung einer gemeinsamen Zielfunktion, die auf dem Mehrfachschnitt-Problem in der Graphenzerlegung basiert. Zweitens erweitern wir unseren Ansatz zur Posen-Bestimmung auf das Verfolgen mehrerer Personen und deren Artikulation in Videos. Unser Ansatz fĂŒhrt eine Verfolgung mehrerer Ziele und die SchĂ€tzung der zugehörigen Posen in ganzheitlicher Weise durch, indem eine einzelne Zielfunktion optimiert wird. Desweiteren vereinfachen und verfeinern wir die Formulierung, was unsere Methode nah an Echtzeit-Leistung bringt. Drittens schlagen wir einen großen Datensatz und einen Bewertungsmaßstab fĂŒr die Verfolgung mehrerer artikulierter Personen vor. Dies ist der erste Datensatz der Video-Sequenzen von komplexen Szenen mit mehreren Personen beinhaltet und deren Spuren komplett mit zwei-dimensionalen Markierungen der SchlĂŒsselpunkte versehen sind. Unser vierter Beitrag ist eine Methode zur SchĂ€tzung von drei-dimensionalen Körperhaltungen mittels am Körper tragbarer Kameras. Unser Ansatz verwendet ein Paar nach unten gerichteter, am Kopf befestigter Kameras und erfasst den gesamten Körper. Dieser egozentrische Ansatz ist frei von jeglichen Limitierungen traditioneller Konfigurationen mit externen Kameras und kann Körperhaltungen in sehr dicht gedrĂ€ngten Umgebungen bestimmen. Unser letzter Beitrag geht ĂŒber die SchĂ€tzung menschlicher Posen hinaus in den Bereich des tiefen Lernens der Gestalt von drei-dimensionalen Objekten. Insbesondere befassen wir uns mit dem Fall drei-dimensionale Objekte unter schwacher Überwachung zu rekonstruieren. Unser Ansatz reprĂ€sentiert Objekte als drei-dimensionale Punktwolken and ist im Stande diese nur mittels zwei-dimensionaler Überwachung und ohne Informationen ĂŒber die Kamera-Ausrichtung zur Trainingszeit zu lernen. Wir entwerfen einen differenzierbaren Renderer fĂŒr Punktwolken sowie eine neue Formulierung um mit uneindeutigen Kamera-Ausrichtungen umzugehen

    Towards Accurate and Efficient Cell Tracking During Fly Wing Development

    Understanding the development, organization, and function of tissues is a central goal in developmental biology. With modern time-lapse microscopy, it is now possible to image entire tissues during development and thereby localize subcellular proteins. A particularly productive area of research is the study of single layer epithelial tissues, which can be simply described as a 2D manifold. For example, the apical band of cell adhesions in epithelial cell layers actually forms a 2D manifold within the tissue and provides a 2D outline of each cell. The Drosophila melanogaster wing has become an important model system, because its 2D cell organization has the potential to reveal mechanisms that create the final fly wing shape. Other examples include structures that naturally localize at the surface of the tissue, such as the ciliary components of planarians. Data from these time-lapse movies typically consists of mosaics of overlapping 3D stacks. This is necessary because the surface of interest exceeds the field of view of todays microscopes. To quantify cellular tissue dynamics, these mosaics need to be processed in three main steps: (a) Extracting, correcting, and stitching individ- ual stacks into a single, seamless 2D projection per time point, (b) obtaining cell characteristics that occur at individual time points, and (c) determine cell dynamics over time. It is therefore necessary that the applied methods are capable of handling large amounts of data efficiently, while still producing accurate results. This task is made especially difficult by the low signal to noise ratios that are typical in live-cell imaging. In this PhD thesis, I develop algorithms that cover all three processing tasks men- tioned above and apply them in the analysis of polarity and tissue dynamics in large epithelial cell layers, namely the Drosophila wing and the planarian epithelium. First, I introduce an efficient pipeline that preprocesses raw image mosaics. This pipeline accurately extracts the stained surface of interest from each raw image stack and projects it onto a single 2D plane. It then corrects uneven illumination, aligns all mosaic planes, and adjusts brightness and contrast before finally stitching the processed images together. This preprocessing does not only significantly reduce the data quantity, but also simplifies downstream data analyses. Here, I apply this pipeline to datasets of the developing fly wing as well as a planarian epithelium. I additionally address the problem of determining cell polarities in chemically fixed samples of planarians. Here, I introduce a method that automatically estimates cell polarities by computing the orientation of rootlets in motile cilia. With this technique one can for the first time routinely measure and visualize how tissue polarities are established and maintained in entire planarian epithelia. Finally, I analyze cell migration patterns in the entire developing wing tissue in Drosophila. At each time point, cells are segmented using a progressive merging ap- proach with merging criteria that take typical cell shape characteristics into account. The method enforces biologically relevant constraints to improve the quality of the resulting segmentations. For cases where a full cell tracking is desired, I introduce a pipeline using a tracking-by-assignment approach. This allows me to link cells over time while considering critical events such as cell divisions or cell death. This work presents a very accurate large-scale cell tracking pipeline and opens up many avenues for further study including several in-vivo perturbation experiments as well as biophysical modeling. The methods introduced in this thesis are examples for computational pipelines that catalyze biological insights by enabling the quantification of tissue scale phenomena and dynamics. I provide not only detailed descriptions of the methods, but also show how they perform on concrete biological research projects

    Applications of Markov Random Field Optimization and 3D Neural Network Pruning in Computer Vision

    Recent years witness the rapid development of Convolutional Neural Network (CNN) in various computer vision applications that were traditionally addressed by Markov Random Field (MRF) optimization methods. Even though CNN based methods achieve high accuracy in these tasks, a high level of fine results are difficult to be achieved. For instance, a pairwise MRF optimization method is capable of segmenting objects with the auxiliary edge information through the second-order terms, which is very uncertain to be achieved by a deep neural network. MRF optimization methods, however, are able to enhance the performance with an explicit theoretical and experimental supports using iterative energy minimization. Secondly, such an edge detector can be learned by CNNs, and thus, seeking to transfer the task of a CNN for another task becomes valuable. It is desirable to fuse the superpixel contours from a state-of-the-art CNN with semantic segmentation results from another state-of-the-art CNN so that such a fusion enhances the object contours in semantic segmentation to be aligned with the superpixel contours. This kind of fusion is not limited to semantic segmentation but also other tasks with a collective effect of multiple off-the-shelf CNNs. While fusing multiple CNNs is useful to enhance the performance, each of such CNNs is usually specifically designed and trained with an empirical configuration of resources. With such a large batch size, however, the joint CNN training is possible to be out of GPU memory. Such a problem is usually involved in efficient CNN training yet with limited resources. This issue is more obvious and severe in 3D CNNs than 2D CNNs due to the high requirement of training resources. To solve the first problem, we propose two fast and differentiable message passing algorithms, namely Iterative Semi-Global Matching Revised (ISGMR) and Parallel Tree-Reweighted Message Passing (TRWP), for both energy minimization problems and deep learning applications. Our experiments on stereo vision dataset and image inpainting dataset validate the effectiveness and efficiency of our methods with minimum energies comparable to the state-of-the-art algorithm TRWS and greatly improve the forward and backward propagation speed using CUDA programming on massive parallel trees. Applying these two methods on deep learning semantic segmentation on PASCAL VOC 2012 with Canny edges achieves enhanced segmentation results measured by mean Intersection over Union (mIoU). In the second problem, to effectively fuse and finetune multiple CNNs, we present a transparent initialization module that identically maps the output of a multiple-layer module to its input at the early stage of finetuning. The pretrained model parameters are then gradually divergent in training as the loss decreases. This transparent initialization has a higher initialization rate than Net2Net and a higher recovery rate compared with random initialization and Xavier initialization. Our experiments validate the effectiveness of the proposed transparent initialization and the sparse encoder with sparse matrix operations. The edges of segmented objects achieve a higher performance ratio and a higher F-measure than other comparable methods. In the third problem, to compress a CNN effectually, especially for resource-inefficient 3D CNNs, we propose a single-shot neuron pruning method with resource constraints. The pruning principle is to remove the neurons with low neuron importance corresponding to small connection sensitivities. The reweighting strategy with the layerwise consumption of memory or FLOPs improves the pruning ability by avoiding infeasible pruning of the whole layer(s). Our experiments on point cloud dataset, ShapeNet, and medical image dataset, BraTS'18, prove the effectiveness of our method. Applying our method to video classification on UCF101 dataset using MobileNetV2 and I3D further strengthens the benefits of our method

    People detection and tracking in crowded scenes

    People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.Personen sind oft ein zentraler Bestandteil visueller Szenen, besonders in natĂŒrlichen Straßenszenen. Daher ist es seit langem ein Ziel der Computer Vision, Methoden zu entwickeln, um Personen in einer Szene zu analysieren. Aufgrund der KomplexitĂ€t natĂŒrlicher Szenen bleibt das visuelle VerstĂ€ndnis von Personen eine Herausforderung fĂŒr die maschinelle Wahrnehmung. Im Zentrum dieser Arbeit steht die Weiterentwicklung von Verfahren zur Detektion und zum Tracking von Personen in Straßenszenen mit Menschenmengen. Wir erforschen darĂŒber hinaus neue Methoden zur menschlichen PosenschĂ€tzung und Bewegungssegmentierung in realistischen Bildern und Videos. ZunĂ€chst schlagen wir Detektionsmodelle vor, die gemeinsam trainiert werden, um sowohl einzelne Personen als auch Personenpaare bei verschiedener Verdeckung zu detektieren. Der Lernalgorithmus unseres gemeinsamen Detektors erleichtert eine enge Integration von Tracking und Detektion, da er darauf konzipiert ist, hĂ€ufige FehlerfĂ€lle aufgrund langfristiger Verdeckungen zwischen Objekten wĂ€hrend des Tracking anzugehen. Zweitens schlagen wir neue Modelle fĂŒr das Tracking mehrerer Personen vor, die das Tracking als Problem der Graphenpartitionierung formulieren. Unsere Mod- elle clustern Detektionshypothesen gemeinsam in Raum und Zeit und eliminieren dadurch die Notwendigkeit einer heuristischen UnterdrĂŒckung nicht maximaler De- tektionen. Bei Szenen mit Menschenmengen kodiert unser Trackingmodell darĂŒber hinaus einheitlich und genau Informationen zur langfristigen Re-Identifizierung in den Clusteringprozess der Detektionen. Drittens untersuchen wir die visuelle Trackingaufgabe bei verschiedener Gran- ularitĂ€t. Wir stellen ein Trackingmodell vor, das im Zeitablauf gleichzeitig Begren- zungsrahmen von Objekten und Trajektorien auf Pixelebene clustert. Diese Herange- hensweise ermöglicht ein umfassendes VerstĂ€ndnis der Bewegung der Objekte in der Szene. Schließlich erweitern wir unser Trackingmodell fĂŒr die PosenschĂ€tzung mehrerer Personen. Wir fĂŒhren ein Modell zur gemeinsamen Graphzerlegung und Knoten- klassifikation ein, mit dem wir gleichzeitig die Posen aller Personen in der Szene schĂ€tzen. Zusammengefasst widmet sich diese Arbeit einer Reihe verschiedener Aufgaben mit dem gemeinsamen Ziel, Bildverarbeitungssystemen die Analyse von Personen in realistischen Bildern und Videos zu ermöglichen. Insbesondere schlĂ€gt die Arbeit mehrere neue AnsĂ€tze und genaue mathematische Formulierungen vor, und sie zeigt Methoden, welche die Grenze des neuesten Stands der Technik ĂŒberschreiten und eine höhere Leistung von Bildverarbeitungssystemen ermöglichen

    Lifting of Multicuts

