6 research outputs found
High Resolution Face Editing with Masked GAN Latent Code Optimization
Face editing represents a popular research topic within the computer vision
and image processing communities. While significant progress has been made
recently in this area, existing solutions: (i) are still largely focused on
low-resolution images, (ii) often generate editing results with visual
artefacts, or (iii) lack fine-grained control and alter multiple (entangled)
attributes at once, when trying to generate the desired facial semantics. In
this paper, we aim to address these issues though a novel attribute editing
approach called MaskFaceGAN. The proposed approach is based on an optimization
procedure that directly optimizes the latent code of a pre-trained
(state-of-the-art) Generative Adversarial Network (i.e., StyleGAN2) with
respect to several constraints that ensure: (i) preservation of relevant image
content, (ii) generation of the targeted facial attributes, and (iii)
spatially--selective treatment of local image areas. The constraints are
enforced with the help of an (differentiable) attribute classifier and face
parser that provide the necessary reference information for the optimization
procedure. MaskFaceGAN is evaluated in extensive experiments on the CelebA-HQ,
Helen and SiblingsDB-HQf datasets and in comparison with several
state-of-the-art techniques from the literature, i.e., StarGAN, AttGAN, STGAN,
and two versions of InterFaceGAN. Our experimental results show that the
proposed approach is able to edit face images with respect to several facial
attributes with unprecedented image quality and at high-resolutions
(1024x1024), while exhibiting considerably less problems with attribute
entanglement than competing solutions. The source code is made freely available
from: https://github.com/MartinPernus/MaskFaceGAN.Comment: The updated paper will be submitted to IEEE Transactions on Image
Processing. Added more qualitative and quantitative results to the main part
of the paper. This version now also includes the supplementary materia
Automatic image editing with generative neural network models based on linguistic descriptions
Področje računalniškega vida in umetne inteligence je v zadnjih letih z metodami
globokega učenja doseglo velike uspehe na področju ustvarjanja slik. V ozadju
teh rezultatov so generativni modeli globokih nevronskih omrežij, ki so zmožni
ustvarjanja fotorealističnih in vizualno prepričljivih slik različnih objektov in celo
kompleksnih scen. Kljub napredkom v ustvarjanju slik pa sta razumevanje in uporaba generativnih modelov za urejanje slik še vedno omejena. Pri tem z izrazom razumevanje
označujemo omogočanje stabilnega učenja generativnih modelov in povezavo med
latentno ter ustvarjeno (slikovno) ciljno verjetnostno porazdelitvijo podatkov.
Nad urejanjem vsakdanjih slik še ne obstaja avtomatiziran mehanizem upravljanja, ki bi omogočal urejanje le točno določenih slikovnih lastnosti. Sistemi, ki bi
omogočali urejanje slik z generativnimi modeli na podlagi jezikovnih opisov, bi
bistveno prispevali k namenom uporabe na različnih področjih, kot so avtonomna vožnja, robotika, proizvodna industrija, oblikovanje, zabavna industrija in
animacija. V takšnih sistemih bi lahko uporabnik s pomočjo besedilnega oziroma
govornega opisa vidnega prizora vplival na videz in semantično vsebino slike.
Glavna tematika doktorske disertacije je razvoj sistema generativnega nevronskega omrežja v kombinaciji z jezikovnim opisom, pri čemer je cilj izluščiti informacijo o želenih lastnostih oz. spremembah slik iz jezikovnih opisov in to informacijo uporabiti za urejanje slik. Izhodišče za naše raziskave so generativna
nevronska omrežja, ki jih gradimo tako, da glede na jezikovno ali bolj strukturirano informacijo ustvarimo oziroma uredimo ˇzeleno sliko. V sklopu doktorske
disertacije predstavimo več različnih izvirnih prispevkov.
Prvi izvirni prispevek je nova metoda za urejanje obraznih lastnosti, imenovana MaskFaceGAN. Predstavljena metoda glede na generativni model slik omogoča manipulacijo različnih obraznih lastnosti (npr. barvo las, tip obrvi, velikost nosu). Ciljno jezikovno informacijo, ki je zahtevana za urejanje obraza, podamo v obliki izbire in intenzivnosti določene obrazne lastnosti. S posebnim procesom invertiranja generativnega omrežja predlagana rešitev omogoča visokoločljivostno urejanje obrazov, ki omogoča tudi hkratno urejanje več lastnosti in spremembo velikosti posameznih delov obraza. Na različnih podatkovnih zbirkah opravimo eksperimente in uporabniško študijo, ki kažejo na prednosti predlagane metode MaskFaceGAN v primerjavi s konkurenčnimi tehnologijami.
Naslednji izvirni prispevek je metoda ChildNet, model, ki lahko napove videz otrok glede na slike staršev. ChildNet omogoča sintezo slike otroka glede na vhodni sliki staršev, pri čemer lahko modelu dodamo tudi jezikovno informacijo v obliki dodatnih zahtev glede videza otroka (starost in spol). Prav tako predstavimo novo visokoločljivostno podatkovno zbirko, namenjeno učenju modelov za sintezo slik glede na sorodstvena razmerja. ChildNet ovrednotimo v primerjavi s konkurenčnimi tehnologijami, pri tem pa naša metoda natančneje oceni videz otroka, pri čemer je nastala slika visoke kakovosti in ločljivosti.
Zadnji izvirni prispevek predstavlja metodo FICE, ki se osredotoča na besedilno pogojeno urejanje modnih slik. Jezikovna informacije je pri tem podana v najbolj surovi obliki, tj. v obliki besedilnega opisa. Metoda je sposobna obdelave besedilnih opisov, ki lahko izražajo širok besedni zaklad. Koncept urejanja slike temelji na invertiranju generativnega omrežja, ki ga nadgradimo z več omejitvami, kot so semantična omejitev, omejitev slikovne kompozicije, omejitev drže in regularizacija latentne kode. Omejitve so realizirane s prednaučenimi odvedljivimi nevronskimi omrežji, pri katerih je sam model specializiran za urejanje modnih slik. Za oceno kakovosti metode predlagamo več različnih metrik, ki se osredotočajo na kakovost slik, ohranitev drže osebe, semantično ustreznost in ohranitev identitete. Metode primerjamo z drugimi tehnologijami besedilno pogojenega urejanja slik, kjer se izkaže, da je metoda FICE boljša v vseh testiranih metrikah.
Če povzamemo, se vsi izvirni prispevki osredotočajo na razumevanje in gradnjo generativnih modelov oziroma razvoj sistemov, pri katerih ciljno jezikovno informacijo vnesemo v naš model za ustvarjanje želene slike. Rezultati raziskav kažejo na potencial generativnih modelov za urejanje slik in pomen razumevanja povezave med latentnimi in ciljnimi verjetnostnimi porazdelitvami. Predlagane metode in sistemi lahko pomembno prispevajo k široki paleti namenom uporabe na različnih področjih.In recent years, the fields of computer vision and artificial intelligence have made mgreat strides in the field of image generation using deep-learning methods. Behind these results are generative deep neural network models that are capable of generating photorealistic and visually convincing images of different objects and meven complex scenes. Despite advances in image generation, the understanding of generative models and their application to image editing is still limited. Here, we use the term understanding to denote the ability of robust learning of generative models and the link between latent and target (image) probability distributions of the data.
There is not yet an automated management mechanism over general image editing that would allow editing only specific image properties. Systems that would allow image editing with generative models based on linguistic descriptions would contribute significantly to applications in various fields such as autonomous driving, robotics, manufacturing, design, entertainment, animation, and others. In such systems, the user could influence the appearance and semantic content of an image by means of a textual or speech description of the visual scene.
The main topic of the PhD thesis is building a generative neural network system in combination with linguistic description, where the goal is to extract information about desired features or changes of images from linguistic descriptions and then use this information for image editing. The starting point for our research is a generative neural network, which is built in a way that enables creating or editing a desired image given linguistic or more structured information. We present several different original contributions as part of our PhD thesis. The first original contribution is a new method for editing facial attributes called MaskFaceGAN. Given a generative image model, the presented method
allows the manipulation of different facial features (e.g. hair colour, eyebrow type, nose size). The target linguistic information required for face editing is given in the form of the selection and intensity of a particular facial feature. By designing a special generative network inverting process, the proposed solution enables high-resolution face editing, which also allows simultaneous editing of multiple features and resizing of individual facial parts. Experiments and a user study are performed on different datasets, which show the advantages of the proposed MaskFaceGAN method over competing technologies.
The next original contribution is the ChildNet method, a model that is able to predict the appearance of children given the images of their parents. ChildNet is able to synthesize an image of a child given an input image of the parents, where additional linguistic information can be added to the model in the form of additional requirements on the child’s appearance (age and gender). We also present a new high-resolution dataset that is designed to learn models for image synthesis given sibling relationships. We evaluate ChildNet against other competing technologies, where our method is shown to more accurately estimate the
appearance of the child, producing images of high quality and resolution.
The last original paper presents the FICE method, which addresses text-based fashion image editing. The linguistic information here is given in its most raw form, i.e. in the form of text. The method is capable of processing textual descriptions that can express a wide vocabulary. The concept of image editing is based on the inversion of a generative network, where the model itself is specialised for editing fashion images. To evaluate the quality of the method, we propose several different metrics focusing on image quality, person pose preservation, semantic relevance and identity preservation. We compare the methods with other textbased image editing technologies, where the FICE method is shown to outperform in all tested metrics.
In summary, all the original contributions focus on understanding and building generative models or developing systems where the target linguistic information is fed in some way into our model to generate the desired image. The results of the research demonstrate the potential of generative models for image editing and the importance of understanding the link between latent and target probability distributions. The proposed methods and systems have the potential to contribute significantly to a wide range of applications in various fields
Comparison of Short-Term and Long-Term Trackers in the Waterborne Environment
V delu predstavimo analizo delovanja kratkoročnih in dolgoročnih sledilnikov na videoposnetkih, narejenih v vodnem okolju. Iz obstoječih podatkovnih baz, posnetih na morski površini z majhnega plovila, ustvarimo dve novi podatkovni bazi. Prva je primerna za testiranje kratkoročnih sledilnikov, druga pa za testiranje dolgoročnih sledilnikov. Na podatkovnih bazah nato testiramo nekaj kratkoročnih in dolgoročnih sledilnikov, ki so dosegli najboljše rezultate na tekmovanjih Visual Object Tracking. Rezultati kratkoročnega sledenja so predstavljeni v obliki pričakovanega povprečnega prekrivanja, povprečnega prekrivanja, števila odpovedi in krivulj pričakovanega povprečnega prekrivanja. Analiziramo tudi hitrost procesiranja sličic kratkoročnih sledilnikov. Poleg nadzorovanega eksperimenta na njih opravimo še realnočasovni eksperiment. Pri dolgoročnih sledilnikih rezultate predstavimo z grafi primerjave natančnosti in priklica ter grafom F-ocene. Tudi dolgoročnim sledilnikom testiramo hitrost procesiranja sličic. Rezultati so analizirani in primerjani z rezultati v podobnih študijah.In this work, an analysis of short-term and long-term trackers performance on marine-based dataset of video sequences is presented. To this end we used already existing marine datasets to create two new marine datasets. The first dataset is suitable for short-term tracking analysis, while the other one is suitable for long-term tracking analysis. We perform experiments on those datasets using some of the top-performing short-term and long-term trackers based on their Visual Object Tracking competitions performance. The results of short-term tracking are presented in form of expected average overlap, average overlap, number of failures and expected average overlap curves. Processing speed of short-term trackers is also evaluated. In addition to the supervised experiment, realtime experiment is performed as well. The results for long-term trackers are presented as a precision-recall graph and F-score graph. Long-term trackers are evaluated based on their processing speed as well. The results are analyzed and compared to results in similar studies
FICE
Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: https://github.com/MartinPernus/FICE
Perceptography unveils the causal contribution of inferior temporal cortex to visual perception
Abstract Neurons in the inferotemporal (IT) cortex respond selectively to complex visual features, implying their role in object perception. However, perception is subjective and cannot be read out from neural responses; thus, bridging the causal gap between neural activity and perception demands independent characterization of perception. Historically, though, the complexity of the perceptual alterations induced by artificial stimulation of IT cortex has rendered them impossible to quantify. To address this old problem, we tasked male macaque monkeys to detect and report optical impulses delivered to their IT cortex. Combining machine learning with high-throughput behavioral optogenetics, we generated complex and highly specific images that were hard for the animal to distinguish from the state of being cortically stimulated. These images, named “perceptograms” for the first time, reveal and depict the contents of the complex hallucinatory percepts induced by local neural perturbation in IT cortex. Furthermore, we found that the nature and magnitude of these hallucinations highly depend on concurrent visual input, stimulation location, and intensity. Objective characterization of stimulation-induced perceptual events opens the door to developing a mechanistic theory of visual perception. Further, it enables us to make better visual prosthetic devices and gain a greater understanding of visual hallucinations in mental disorders