16 research outputs found

    MTRNet: A Generic Scene Text Eraser

    Full text link
    Text removal algorithms have been proposed for uni-lingual scripts with regular shapes and layouts. However, to the best of our knowledge, a generic text removal method which is able to remove all or user-specified text regions regardless of font, script, language or shape is not available. Developing such a generic text eraser for real scenes is a challenging task, since it inherits all the challenges of multi-lingual and curved text detection and inpainting. To fill this gap, we propose a mask-based text removal network (MTRNet). MTRNet is a conditional adversarial generative network (cGAN) with an auxiliary mask. The introduced auxiliary mask not only makes the cGAN a generic text eraser, but also enables stable training and early convergence on a challenging large-scale synthetic dataset, initially proposed for text detection in real scenes. What's more, MTRNet achieves state-of-the-art results on several real-world datasets including ICDAR 2013, ICDAR 2017 MLT, and CTW1500, without being explicitly trained on this data, outperforming previous state-of-the-art methods trained directly on these datasets.Comment: Presented at ICDAR2019 Conferenc

    Video inpainting for non-repetitive motion

    Get PDF
    Master'sMASTER OF SCIENC

    Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN

    Get PDF
    Source at https://proceedings.neurips.cc/paper/2021/hash/151de84cca69258b17375e2f44239191-Abstract.html.Image-based virtual try-on is one of the most promising applications of human-centric image generation due to its tremendous real-world potential. Yet, as most try-on approaches fit in-shop garments onto a target person, they require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability. While a few recent works attempt to transfer garments directly from one person to another, alleviating the need to collect paired datasets, their performance is impacted by the lack of paired (supervised) information. In particular, disentangling style and spatial information of the garment becomes a challenge, which existing methods either address by requiring auxiliary data or extensive online optimization procedures, thereby still inhibiting their scalability. To achieve a scalable virtual try-on system that can transfer arbitrary garments between a source and a target person in an unsupervised manner, we thus propose a texture-preserving end-to-end network, the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN), that facilitates real-world unpaired virtual try-on. Specifically, to disentangle the style and spatial information of each garment, PASTA-GAN consists of an innovative patch-routed disentanglement module for successfully retaining garment texture and shape characteristics. Guided by the source person's keypoints, the patch-routed disentanglement module first decouples garments into normalized patches, thus eliminating the inherent spatial information of the garment, and then reconstructs the normalized patches to the warped garment complying with the target person pose. Given the warped garment, PASTA-GAN further introduces novel spatially-adaptive residual blocks that guide the generator to synthesize more realistic garment details. Extensive comparisons with paired and unpaired approaches demonstrate the superiority of PASTA-GAN, highlighting its ability to generate high-quality try-on images when faced with a large variety of garments(e.g. vests, shirts, pants), taking a crucial step towards real-world scalable try-on

    Novel Video Completion Approaches and Their Applications

    Get PDF
    Video completion refers to automatically restoring damaged or removed objects in a video sequence, with applications ranging from sophisticated video removal of undesired static or dynamic objects to correction of missing or corrupted video frames in old movies and synthesis of new video frames to add, modify, or generate a new visual story. The video completion problem can be solved using texture synthesis and/or data interpolation to fill-in the holes of the sequence inward. This thesis makes a distinction between still image completion and video completion. The latter requires visually pleasing consistency by taking into account the temporal information. Based on their applied concepts, video completion techniques are categorized as inpainting and texture synthesis. We present a bandlet transform-based technique for each of these categories of video completion techniques. The proposed inpainting-based technique is a 3D volume regularization scheme that takes advantage of bandlet bases for exploiting the anisotropic regularities to reconstruct a damaged video. The proposed exemplar-based approach, on the other hand, performs video completion using a precise patch fusion in the bandlet domain instead of patch replacement. The video completion task is extended to two important applications in video restoration. First, we develop an automatic video text detection and removal that benefits from the proposed inpainting scheme and a novel video text detector. Second, we propose a novel video super-resolution technique that employs the inpainting algorithm spatially in conjunction with an effective structure tensor, generated using bandlet geometry. The experimental results show a good performance of the proposed video inpainting method and demonstrate the effectiveness of bandlets in video completion tasks. The proposed video text detector and the video super resolution scheme also show a high performance in comparison with existing methods

    MODELING AND ANALYSIS OF WRINKLES ON AGING HUMAN FACES

    Get PDF
    The analysis and modeling of aging human faces has been extensively studied in the past decade. Most of this work is based on matching learning techniques focused on appearance of faces at different ages incorporating facial features such as face shape/geometry and patch-based texture features. However, we do not find much work done on the analysis of facial wrinkles in general and specific to a person. The goal of this dissertation is to analyse and model facial wrinkles for different applications. Facial wrinkles are challenging low-level image features to analyse. In general, skin texture has drastically varying appearance due to its characteristic physical properties. A skin patch looks very different when viewed or illuminated from different angles. This makes subtle skin features like facial wrinkles difficult to be detected in images acquired in uncontrolled imaging settings. In this dissertation, we examine the image properties of wrinkles i.e. intensity gradients and geometric properties and use them for several applications including low-level image processing for automatic detection/localization of wrinkles, soft biometrics and removal of wrinkles using digital inpainting. First, we present results of detection/localization of wrinkles in images using Marked Point Process (MPP). Wrinkles are modeled as sequences of line segments in a Bayesian framework which incorporates a prior probability model based on the likely geometric properties of wrinkles and a data likelihood term based on image intensity gradients. Wrinkles are localized by sampling the posterior probability using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm. We also present an evaluation algorithm to quantitatively evaluate the detection and false alarm rate of our algorithm and conduct experiments with images taken in uncontrolled settings. The MPP model, despite its promising localization results, requires a large number of iterations in the RJMCMC algorithm to reach global minimum resulting in considerable computation time. This motivated us to adopt a deterministic approach based on image morphology for fast localization of facial wrinkles. We propose image features based on Gabor filter banks to highlight subtle curvilinear discontinuities in skin texture caused by wrinkles. Then, image morphology is used to incorporate geometric constraints to localize curvilinear shapes of wrinkles at image sites of large Gabor filter responses. We conduct experiments on two sets of low and high resolution images to demonstrate faster and visually better localization results as compared to those obtained by MPP modeling. As a next application, we investigate the user-drawn and automatically detected wrinkles as a pattern for their discriminative power as a soft biometrics to recognize subjects from their wrinkle patterns only. A set of facial wrinkles from an image is treated as a curve pattern and used for subject recognition. Given the wrinkle patterns from a query and gallery images, several distance measures are calculated between the two patterns to quantify the similarity between them. This is done by finding the possible correspondences between curves from the two patterns using a simple bipartite graph matching algorithm. Then several metrics are used to calculate the similarity between the two wrinkle patterns. These metrics are based on Hausdorff distance and curve-to-curve correspondences. We conduct experiments on data sets of both hand drawn and automatically detected wrinkles. Finally, we apply digital inpainting to automatically remove wrinkles from facial images. Digital image inpainting refers to filling in the holes of arbitrary shapes in images so that they seem to be part of the original image. The inpainting methods target either the structure or the texture of an image or both. There are two limitations of existing inpainting methods for the removal of wrinkles. First, the differences in the attributes of structure and texture requires different inpainting methods. Facial wrinkles do not fall strictly under the category of structure or texture and can be considered as some where in between. Second, almost all of the image inpainting techniques are supervised i.e. the area/gap to be filled is provided by user interaction and the algorithms attempt to find the suitable image portion automatically. We present an unsupervised image inpainting method where facial regions with wrinkles are detected automatically using their characteristic intensity gradients and removed by painting the regions by the surrounding skin texture

    Adversarial content manipulation for analyzing and improving model robustness

    Get PDF
    The recent rapid progress in machine learning systems has opened up many real-world applications --- from recommendation engines on web platforms to safety critical systems like autonomous vehicles. A model deployed in the real-world will often encounter inputs far from its training distribution. For example, a self-driving car might come across a black stop sign in the wild. To ensure safe operation, it is vital to quantify the robustness of machine learning models to such out-of-distribution data before releasing them into the real-world. However, the standard paradigm of benchmarking machine learning models with fixed size test sets drawn from the same distribution as the training data is insufficient to identify these corner cases efficiently. In principle, if we could generate all valid variations of an input and measure the model response, we could quantify and guarantee model robustness locally. Yet, doing this with real world data is not scalable. In this thesis, we propose an alternative, using generative models to create synthetic data variations at scale and test robustness of target models to these variations. We explore methods to generate semantic data variations in a controlled fashion across visual and text modalities. We build generative models capable of performing controlled manipulation of data like changing visual context, editing appearance of an object in images or changing writing style of text. Leveraging these generative models we propose tools to study robustness of computer vision systems to input variations and systematically identify failure modes. In the text domain, we deploy these generative models to improve diversity of image captioning systems and perform writing style manipulation to obfuscate private attributes of the user. Our studies quantifying model robustness explore two kinds of input manipulations, model-agnostic and model-targeted. The model-agnostic manipulations leverage human knowledge to choose the kinds of changes without considering the target model being tested. This includes automatically editing images to remove objects not directly relevant to the task and create variations in visual context. Alternatively, in the model-targeted approach the input variations performed are directly adversarially guided by the target model. For example, we adversarially manipulate the appearance of an object in the image to fool an object detector, guided by the gradients of the detector. Using these methods, we measure and improve the robustness of various computer vision systems -- specifically image classification, segmentation, object detection and visual question answering systems -- to semantic input variations.Der schnelle Fortschritt von Methoden des maschinellen Lernens hat viele neue Anwendungen ermöglicht – von Recommender-Systemen bis hin zu sicherheitskritischen Systemen wie autonomen Fahrzeugen. In der realen Welt werden diese Systeme oft mit Eingaben außerhalb der Verteilung der Trainingsdaten konfrontiert. Zum Beispiel könnte ein autonomes Fahrzeug einem schwarzen Stoppschild begegnen. Um sicheren Betrieb zu gewährleisten, ist es entscheidend, die Robustheit dieser Systeme zu quantifizieren, bevor sie in der Praxis eingesetzt werden. Aktuell werden diese Modelle auf festen Eingaben von derselben Verteilung wie die Trainingsdaten evaluiert. Allerdings ist diese Strategie unzureichend, um solche Ausnahmefälle zu identifizieren. Prinzipiell könnte die Robustheit “lokal” bestimmt werden, indem wir alle zulässigen Variationen einer Eingabe generieren und die Ausgabe des Systems überprüfen. Jedoch skaliert dieser Ansatz schlecht zu echten Daten. In dieser Arbeit benutzen wir generative Modelle, um synthetische Variationen von Eingaben zu erstellen und so die Robustheit eines Modells zu überprüfen. Wir erforschen Methoden, die es uns erlauben, kontrolliert semantische Änderungen an Bild- und Textdaten vorzunehmen. Wir lernen generative Modelle, die kontrollierte Manipulation von Daten ermöglichen, zum Beispiel den visuellen Kontext zu ändern, die Erscheinung eines Objekts zu bearbeiten oder den Schreibstil von Text zu ändern. Basierend auf diesen Modellen entwickeln wir neue Methoden, um die Robustheit von Bilderkennungssystemen bezüglich Variationen in den Eingaben zu untersuchen und Fehlverhalten zu identifizieren. Im Gebiet von Textdaten verwenden wir diese Modelle, um die Diversität von sogenannten Automatische Bildbeschriftung-Modellen zu verbessern und Schreibtstil-Manipulation zu erlauben, um private Attribute des Benutzers zu verschleiern. Um die Robustheit von Modellen zu quantifizieren, werden zwei Arten von Eingabemanipulationen untersucht: Modell-agnostische und Modell-spezifische Manipulationen. Modell-agnostische Manipulationen basieren auf menschlichem Wissen, um bestimmte Änderungen auszuwählen, ohne das entsprechende Modell miteinzubeziehen. Dies beinhaltet das Entfernen von für die Aufgabe irrelevanten Objekten aus Bildern oder Variationen des visuellen Kontextes. In dem alternativen Modell-spezifischen Ansatz werden Änderungen vorgenommen, die für das Modell möglichst ungünstig sind. Zum Beispiel ändern wir die Erscheinung eines Objekts um ein Modell der Objekterkennung täuschen. Dies ist durch den Gradienten des Modells möglich. Mithilfe dieser Werkzeuge können wir die Robustheit von Systemen zur Bildklassifizierung oder -segmentierung, Objekterkennung und Visuelle Fragenbeantwortung quantifizieren und verbessern

    Photographic Mediation as a Mode of Production: Investigating the Agency of Commercial Institutions in Contemporary Vernacular Photography

    Get PDF
    This dissertation argues that to understand what is at stake in contemporary vernacular photography, it is vital to account for the commercial imperatives that are invested in our photographic apparatus. The vernacular is often seen as emerging from the milieu of everyday life, operating outside of institutional constraints. However, commercial institutions have always played a vital role in shaping the meaning and matter of vernacular photography, producing the extended network of devices and protocols through which photographic activity takes place. Vernacular photography should therefore be seen to encapsulate a series of complex negotiations between individual desires and commercial imperatives. Through an examination of three central case studies - Kodak, Snapchat and Ditto Labs - this thesis aims to elucidate how the productive potential of vernacular photography is instrumentalized as a means of generating value. Bringing together approaches from western Marxism with contemporary theories of networked media and photography, the argument is made that photographic mediation can be usefully framed as a mode of production. Photographic mediation, referring to the processual and material dynamics of photography, is employed to investigate the circuits of labour, value and desire that flow through our photographic apparatus. In performing this analysis, the concept of deterritorialization is applied as a way of understanding how photographic mediation has become more productive through destabilizing the boundaries between photography, subjectivity and the everyday. As photography proliferates and disperses into the rhythms and atmospheres that constitute daily life, it is increasingly imbricated into the performance and production of identities, relationships and desires. Under these circumstances, it becomes all the more vital that we recognize the role of commercial actors in shaping not only our photographic apparatus, but also our ways of being in, and relating to, the world
    corecore