26 research outputs found

    IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

    Full text link
    We present a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS) that produces smooth, direct, and realistic interpolations given an image pair. A latent diffusion model has distinct conditional distributions and data embeddings for each of the two images, especially when they are from different classes. To bridge this gap, we interpolate in the locally linear and continuous text embedding space and Gaussian latent space. We first optimize the endpoint text embeddings and then map the images to the latent space using a probability flow ODE. Unlike existing work that takes an indirect morphing path, we show that the model adaptation yields a direct path and suppresses ghosting artifacts in the interpolated images. To achieve this, we propose an adaptive bottleneck constraint based on a novel relative perceptual path diversity score that automatically controls the bottleneck size and balances the diversity along the path with its directness. We also propose a perceptually-uniform sampling technique that enables visually smooth changes between the interpolated images. Extensive experiments validate that our IMPUS can achieve smooth, direct, and realistic image morphing and be applied to other image generation tasks

    DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

    Full text link
    Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However, a notable limitation of diffusion models, in comparison to GANs, is their difficulty in smoothly interpolating between two image samples, due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work, we present DiffMorpher, the first approach enabling smooth and natural image interpolation using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition, where correspondence automatically emerges without the need for annotation. In addition, we propose an attention interpolation and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories, bridging a critical functional gap that distinguished diffusion models from GANs

    Multiview Regenerative Morphing with Dual Flows

    Full text link
    This paper aims to address a new task of image morphing under a multiview setting, which takes two sets of multiview images as the input and generates intermediate renderings that not only exhibit smooth transitions between the two input sets but also ensure visual consistency across different views at any transition state. To achieve this goal, we propose a novel approach called Multiview Regenerative Morphing that formulates the morphing process as an optimization to solve for rigid transformation and optimal-transport interpolation. Given the multiview input images of the source and target scenes, we first learn a volumetric representation that models the geometry and appearance for each scene to enable the rendering of novel views. Then, the morphing between the two scenes is obtained by solving optimal transport between the two volumetric representations in Wasserstein metrics. Our approach does not rely on user-specified correspondences or 2D/3D input meshes, and we do not assume any predefined categories of the source and target scenes. The proposed view-consistent interpolation scheme directly works on multiview images to yield a novel and visually plausible effect of multiview free-form morphing

    Object-centric generative models for robot perception and action

    Get PDF
    The system of robot manipulation involves a pipeline consisting of the perception of objects in the environment and the planning of actions in 3D space. Deep learning approaches are employed to segment scenes into components of objects and then learn object-centric features to predict actions for downstream tasks. Despite having achieved promising performance in several manipulation tasks, supervised approaches lack inductive biases related to general properties of objects. Recent advances show that by encoding and reconstructing scenes in an object-centric fashion, the model can discover object-like entities from raw data without human supervision. Moreover, by reconstructing the discovered objects, the model can learn a variational latent space that captures the various shapes and textures of the objects, regularised by a chosen prior distribution. In this thesis, we investigate the properties of this learned object-centric latent space and develop novel object-centric generative models (OCGMs) that can be applied to real-world robotics scenarios. In the first part of this thesis, we investigate a tool-synthesis task which leverages a learned latent space to optimise a wide range of tools applied to a reaching task. Given an image that illustrates the obstacles and the reaching target in the scene, an affordance predictor is trained to predict the feasibility of the tool for the given task. To imitate human tool-use experiences, feasibility labels are acquired from simulated trial-and-errors of the reaching task. We found that by employing an activation maximisation step, the model can synthesis proper tools for the given tasks with high accuracy. Moreover, the tool-synthesis process indicates the existence of a task-relevant trajectory in the learned latent space that can be found by a trained affordance predictor. The second part of this thesis focuses on the development of novel OCGMs and their applications to robotic tasks. We first introduce a 2D OCGM that is deployed to robot manipulation datasets in both simulation and real-world scenarios. Despite the intensive interactions between robot arm and objects, we find the model discovers meaningful object entities from the raw observations without any human supervision. We next upgrade the 2D OCGM to 3D by leveraging NeRFs as decoders to explicitly model the 3D geometry of objects and the background. To disentangle the object spatial information from its appearance information, we propose a minimum volume principle for unsupervised 6D pose estimation of the objects. Considering the occlusion in the scene, we further improve the pose estimation by introducing a shape completion module to imagine the unobserved parts of the objects before the pose estimation step. In the end, we successfully apply the model in real-world robotics scenarios and compare its performance in several tasks including the 3D reconstruction, object-centric latent representation learning, 6D pose estimation for object rearrangement, against several baselines. We find that despite being an unsupervised approach, our model achieves improved performance across a range of different real-world tasks

    Efficient image-based rendering

    Get PDF
    Recent advancements in real-time ray tracing and deep learning have significantly enhanced the realism of computer-generated images. However, conventional 3D computer graphics (CG) can still be time-consuming and resource-intensive, particularly when creating photo-realistic simulations of complex or animated scenes. Image-based rendering (IBR) has emerged as an alternative approach that utilizes pre-captured images from the real world to generate realistic images in real-time, eliminating the need for extensive modeling. Although IBR has its advantages, it faces challenges in providing the same level of control over scene attributes as traditional CG pipelines and accurately reproducing complex scenes and objects with different materials, such as transparent objects. This thesis endeavors to address these issues by harnessing the power of deep learning and incorporating the fundamental principles of graphics and physical-based rendering. It offers an efficient solution that enables interactive manipulation of real-world dynamic scenes captured from sparse views, lighting positions, and times, as well as a physically-based approach that facilitates accurate reproduction of the view dependency effect resulting from the interaction between transparent objects and their surrounding environment. Additionally, this thesis develops a visibility metric that can identify artifacts in the reconstructed IBR images without observing the reference image, thereby contributing to the design of an effective IBR acquisition pipeline. Lastly, a perception-driven rendering technique is developed to provide high-fidelity visual content in virtual reality displays while retaining computational efficiency.J眉ngste Fortschritte im Bereich Echtzeit-Raytracing und Deep Learning haben den Realismus computergenerierter Bilder erheblich verbessert. Konventionelle 3DComputergrafik (CG) kann jedoch nach wie vor zeit- und ressourcenintensiv sein, insbesondere bei der Erstellung fotorealistischer Simulationen von komplexen oder animierten Szenen. Das bildbasierte Rendering (IBR) hat sich als alternativer Ansatz herauskristallisiert, bei dem vorab aufgenommene Bilder aus der realen Welt verwendet werden, um realistische Bilder in Echtzeit zu erzeugen, so dass keine umfangreiche Modellierung erforderlich ist. Obwohl IBR seine Vorteile hat, ist es eine Herausforderung, das gleiche Ma脽 an Kontrolle 眉ber Szenenattribute zu bieten wie traditionelle CG-Pipelines und komplexe Szenen und Objekte mit unterschiedlichen Materialien, wie z.B. transparente Objekte, akkurat wiederzugeben. In dieser Arbeit wird versucht, diese Probleme zu l枚sen, indem die M枚glichkeiten des Deep Learning genutzt und die grundlegenden Prinzipien der Grafik und des physikalisch basierten Renderings einbezogen werden. Sie bietet eine effiziente L枚sung, die eine interaktive Manipulation von dynamischen Szenen aus der realen Welt erm枚glicht, die aus sp盲rlichen Ansichten, Beleuchtungspositionen und Zeiten erfasst wurden, sowie einen physikalisch basierten Ansatz, der eine genaue Reproduktion des Effekts der Sichtabh盲ngigkeit erm枚glicht, der sich aus der Interaktion zwischen transparenten Objekten und ihrer Umgebung ergibt. Dar眉ber hinaus wird in dieser Arbeit eine Sichtbarkeitsmetrik entwickelt, mit der Artefakte in den rekonstruierten IBR-Bildern identifiziert werden k枚nnen, ohne das Referenzbild zu betrachten, und die somit zur Entwicklung einer effektiven IBR-Erfassungspipeline beitr盲gt. Schlie脽lich wird ein wahrnehmungsgesteuertes Rendering-Verfahren entwickelt, um visuelle Inhalte in Virtual-Reality-Displays mit hoherWiedergabetreue zu liefern und gleichzeitig die Rechenleistung zu erhalten

    Animation and Interaction of Responsive, Expressive, and Tangible 3D Virtual Characters

    Get PDF
    This thesis is framed within the field of 3D Character Animation. Virtual characters are used in many Human Computer Interaction applications such as video games and serious games. Within these virtual worlds they move and act in similar ways to humans controlled by users through some form of interface or by artificial intelligence. This work addresses the challenges of developing smoother movements and more natural behaviors driving motions in real-time, intuitively, and accurately. The interaction between virtual characters and intelligent objects will also be explored. With these subjects researched the work will contribute to creating more responsive, expressive, and tangible virtual characters. The navigation within virtual worlds uses locomotion such as walking, running, etc. To achieve maximum realism, actors' movements are captured and used to animate virtual characters. This is the philosophy of motion graphs: a structure that embeds movements where the continuous motion stream is generated from concatenating motion pieces. However, locomotion synthesis, using motion graphs, involves a tradeoff between the number of possible transitions between different kinds of locomotion, and the quality of these, meaning smooth transition between poses. To overcome this drawback, we propose the method of progressive transitions using Body Part Motion Graphs (BPMGs). This method deals with partial movements, and generates specific, synchronized transitions for each body part (group of joints) within a window of time. Therefore, the connectivity within the system is not linked to the similarity between global poses allowing us to find more and better quality transition points while increasing the speed of response and execution of these transitions in contrast to standard motion graphs method. Secondly, beyond getting faster transitions and smoother movements, virtual characters also interact with each other and with users by speaking. This interaction requires the creation of appropriate gestures according to the voice that they reproduced. Gestures are the nonverbal language that accompanies voiced language. The credibility of virtual characters when speaking is linked to the naturalness of their movements in sync with the voice in speech and intonation. Consequently, we analyzed the relationship between gestures, speech, and the performed gestures according to that speech. We defined intensity indicators for both gestures (GSI, Gesture Strength Indicator) and speech (PSI, Pitch Strength Indicator). We studied the relationship in time and intensity of these cues in order to establish synchronicity and intensity rules. Later we adapted the mentioned rules to select the appropriate gestures to the speech input (tagged text from speech signal) in the Gesture Motion Graph (GMG). The evaluation of resulting animations shows the importance of relating the intensity of speech and gestures to generate believable animations beyond time synchronization. Subsequently, we present a system that leads automatic generation of gestures and facial animation from a speech signal: BodySpeech. This system also includes animation improvements such as: increased use of data input, more flexible time synchronization, and new features like editing style of output animations. In addition, facial animation also takes into account speech intonation. Finally, we have moved virtual characters from virtual environments to the physical world in order to explore their interaction possibilities with real objects. To this end, we present AvatARs, virtual characters that have tangible representation and are integrated into reality through augmented reality apps on mobile devices. Users choose a physical object to manipulate in order to control the animation. They can select and configure the animation, which serves as a support for the virtual character represented. Then, we explored the interaction of AvatARs with intelligent physical objects like the Pleo social robot. Pleo is used to assist hospitalized children in therapy or simply for playing. Despite its benefits, there is a lack of emotional relationship and interaction between the children and Pleo which makes children lose interest eventually. This is why we have created a mixed reality scenario where Vleo (AvatAR as Pleo, virtual element) and Pleo (real element) interact naturally. This scenario has been tested and the results conclude that AvatARs enhances children's motivation to play with Pleo, opening a new horizon in the interaction between virtual characters and robots.Aquesta tesi s'emmarca dins del m贸n de l'animaci贸 de personatges virtuals tridimensionals. Els personatges virtuals s'utilitzen en moltes aplicacions d'interacci贸 home m脿quina, com els videojocs o els serious games, on es mouen i actuen de forma similar als humans dins de mons virtuals, i on s贸n controlats pels usuaris per mitj脿 d'alguna interf铆cie, o d'altra manera per sistemes intel路ligents. Reptes com aconseguir moviments fluids i comportament natural, controlar en temps real el moviment de manera intuitiva i precisa, i incl煤s explorar la interacci贸 dels personatges virtuals amb elements f铆sics intel路ligents; s贸n els que es treballen a continuaci贸 amb l'objectiu de contribuir en la generaci贸 de personatges virtuals responsius, expressius i tangibles. La navegaci贸 dins dels mons virtuals fa 煤s de locomocions com caminar, c贸rrer, etc. Per tal d'aconseguir el m脿xim de realisme, es capturen i reutilitzen moviments d'actors per animar els personatges virtuals. Aix铆 funcionen els motion graphs, una estructura que encapsula moviments i per mitj脿 de cerques dins d'aquesta, els concatena creant un flux continu. La s铆ntesi de locomocions usant els motion graphs comporta un comprom铆s entre el n煤mero de transicions entre les diferents locomocions, i la qualitat d'aquestes (similitud entre les postures a connectar). Per superar aquest inconvenient, proposem el m猫tode transicions progressives usant Body Part Motion Graphs (BPMGs). Aquest m猫tode tracta els moviments de manera parcial, i genera transicions espec铆fiques i sincronitzades per cada part del cos (grup d'articulacions) dins d'una finestra temporal. Per tant, la conectivitat del sistema no est脿 lligada a la similitud de postures globals, permetent trobar m茅s punts de transici贸 i de m茅s qualitat, i sobretot incrementant la rapidesa en resposta i execuci贸 de les transicions respecte als motion graphs est脿ndards. En segon lloc, m茅s enll脿 d'aconseguir transicions r脿pides i moviments fluids, els personatges virtuals tamb茅 interaccionen entre ells i amb els usuaris parlant, creant la necessitat de generar moviments apropiats a la veu que reprodueixen. Els gestos formen part del llenguatge no verbal que acostuma a acompanyar a la veu. La credibilitat dels personatges virtuals parlants est脿 lligada a la naturalitat dels seus moviments i a la concordan莽a que aquests tenen amb la veu, sobretot amb l'entonaci贸 d'aquesta. Aix铆 doncs, hem realitzat l'an脿lisi de la relaci贸 entre els gestos i la veu, i la conseq眉ent generaci贸 de gestos d'acord a la veu. S'han definit indicadors d'intensitat tant per gestos (GSI, Gesture Strength Indicator) com per la veu (PSI, Pitch Strength Indicator), i s'ha estudiat la relaci贸 entre la temporalitat i la intensitat de les dues senyals per establir unes normes de sincronia temporal i d'intensitat. M茅s endavant es presenta el Gesture Motion Graph (GMG), que selecciona gestos adients a la veu d'entrada (text anotat a partir de la senyal de veu) i les regles esmentades. L'avaluaci贸 de les animaciones resultants demostra la import脿ncia de relacionar la intensitat per generar animacions cre\"{ibles, m茅s enll脿 de la sincronitzaci贸 temporal. Posteriorment, presentem un sistema de generaci贸 autom脿tica de gestos i animaci贸 facial a partir d'una senyal de veu: BodySpeech. Aquest sistema tamb茅 inclou millores en l'animaci贸, major reaprofitament de les dades d'entrada i sincronitzaci贸 m茅s flexible, i noves funcionalitats com l'edici贸 de l'estil les animacions de sortida. A m茅s, l'animaci贸 facial tamb茅 t茅 en compte l'entonaci贸 de la veu. Finalment, s'han traslladat els personatges virtuals dels entorns virtuals al m贸n f铆sic per tal d'explorar les possibilitats d'interacci贸 amb objectes reals. Per aquest fi, presentem els AvatARs, personatges virtuals que tenen representaci贸 tangible i que es visualitzen integrats en la realitat a trav茅s d'un dispositiu m貌bil gr脿cies a la realitat augmentada. El control de l'animaci贸 es duu a terme per mitj脿 d'un objecte f铆sic que l'usuari manipula, seleccionant i parametritzant les animacions, i que al mateix temps serveix com a suport per a la representaci贸 del personatge virtual. Posteriorment, s'ha explorat la interacci贸 dels AvatARs amb objectes f铆sics intel路ligents com el robot social Pleo. El Pleo s'utilitza per a assistir a nens hospitalitzats en ter脿pia o simplement per jugar. Tot i els seus beneficis, hi ha una manca de relaci贸 emocional i interacci贸 entre els nens i el Pleo que amb el temps fa que els nens perdin l'inter猫s en ell. Aix铆 doncs, hem creat un escenari d'interacci贸 mixt on el Vleo (un AvatAR en forma de Pleo; element virtual) i el Pleo (element real) interactuen de manera natural. Aquest escenari s'ha testejat i els resultats conclouen que els AvatARs milloren la motivaci贸 per jugar amb el Pleo, obrint un nou horitz贸 en la interacci贸 dels personatges virtuals amb robots.Esta tesis se enmarca dentro del mundo de la animaci贸n de personajes virtuales tridimensionales. Los personajes virtuales se utilizan en muchas aplicaciones de interacci贸n hombre m谩quina, como los videojuegos y los serious games, donde dentro de mundo virtuales se mueven y act煤an de manera similar a los humanos, y son controlados por usuarios por mediante de alguna interfaz, o de otro modo, por sistemas inteligentes. Retos como conseguir movimientos fluidos y comportamiento natural, controlar en tiempo real el movimiento de manera intuitiva y precisa, y incluso explorar la interacci贸n de los personajes virtuales con elementos f铆sicos inteligentes; son los que se trabajan a continuaci贸n con el objetivo de contribuir en la generaci贸n de personajes virtuales responsivos, expresivos y tangibles. La navegaci贸n dentro de los mundos virtuales hace uso de locomociones como andar, correr, etc. Para conseguir el m谩ximo realismo, se capturan y reutilizan movimientos de actores para animar los personajes virtuales. As铆 funcionan los motion graphs, una estructura que encapsula movimientos y que por mediante b煤squedas en ella, los concatena creando un flujo cont铆nuo. La s铆ntesi de locomociones usando los motion graphs comporta un compromiso entre el n煤mero de transiciones entre las distintas locomociones, y la calidad de estas (similitud entre las posturas a conectar). Para superar este inconveniente, proponemos el m茅todo transiciones progresivas usando Body Part Motion Graphs (BPMGs). Este m茅todo trata los movimientos de manera parcial, y genera transiciones espec铆ficas y sincronizadas para cada parte del cuerpo (grupo de articulaciones) dentro de una ventana temporal. Por lo tanto, la conectividad del sistema no est谩 vinculada a la similitud de posturas globales, permitiendo encontrar m谩s puntos de transici贸n y de m谩s calidad, incrementando la rapidez en respuesta y ejecuci贸n de las transiciones respeto a los motion graphs est谩ndards. En segundo lugar, m谩s all谩 de conseguir transiciones r谩pidas y movimientos flu铆dos, los personajes virtuales tambi茅n interaccionan entre ellos y con los usuarios hablando, creando la necesidad de generar movimientos apropiados a la voz que reproducen. Los gestos forman parte del lenguaje no verbal que acostumbra a acompa帽ar a la voz. La credibilidad de los personajes virtuales parlantes est谩 vinculada a la naturalidad de sus movimientos y a la concordancia que estos tienen con la voz, sobretodo con la entonaci贸n de esta. As铆 pues, hemos realizado el an谩lisis de la relaci贸n entre los gestos y la voz, y la consecuente generaci贸n de gestos de acuerdo a la voz. Se han definido indicadores de intensidad tanto para gestos (GSI, Gesture Strength Indicator) como para la voz (PSI, Pitch Strength Indicator), y se ha estudiado la relaci贸n temporal y de intensidad para establecer unas reglas de sincron铆a temporal y de intensidad. M谩s adelante se presenta el Gesture Motion Graph (GMG), que selecciona gestos adientes a la voz de entrada (texto etiquetado a partir de la se帽al de voz) y las normas mencionadas. La evaluaci贸n de las animaciones resultantes demuestra la importancia de relacionar la intensidad para generar animaciones cre铆bles, m谩s all谩 de la sincronizaci贸n temporal. Posteriormente, presentamos un sistema de generaci贸n autom谩tica de gestos y animaci贸n facial a partir de una se帽al de voz: BodySpeech. Este sistema tambi茅n incluye mejoras en la animaci贸n, como un mayor aprovechamiento de los datos de entrada y una sincronizaci贸n m谩s flexible, y nuevas funcionalidades como la edici贸n del estilo de las animaciones de salida. Adem谩s, la animaci贸n facial tambi茅n tiene en cuenta la entonaci贸n de la voz. Finalmente, se han trasladado los personajes virtuales de los entornos virtuales al mundo f铆sico para explorar las posibilidades de interacci贸n con objetos reales. Para este fin, presentamos los AvatARs, personajes virtuales que tienen representaci贸n tangible y que se visualizan integrados en la realidad a trav茅s de un dispositivo m贸vil gracias a la realidad aumentada. El control de la animaci贸n se lleva a cabo mediante un objeto f铆sico que el usuario manipula, seleccionando y configurando las animaciones, y que a su vez sirve como soporte para la representaci贸n del personaje. Posteriormente, se ha explorado la interacci贸n de los AvatARs con objetos f铆sicos inteligentes como el robot Pleo. Pleo se utiliza para asistir a ni帽os en terapia o simplemente para jugar. Todo y sus beneficios, hay una falta de relaci贸n emocional y interacci贸n entre los ni帽os y Pleo que con el tiempo hace que los ni帽os pierdan el inter茅s. As铆 pues, hemos creado un escenario de interacci贸n mixto donde Vleo (AvatAR en forma de Pleo; virtual) y Pleo (real) interact煤an de manera natural. Este escenario se ha testeado y los resultados concluyen que los AvatARs mejoran la motivaci贸n para jugar con Pleo, abriendo un nuevo horizonte en la interacci贸n de los personajes virtuales con robots

    Detecting a visual object in the presence of other objects : the flanker facilitation effect in contour integration

    Get PDF
    When an observer views a complex visual scene and tries to identify an object, his or her visual system must decide what regions of the visual field correspond to the object of interest and which do not. One aspect of this process involves the grouping of the local contrast information (e.g., orientation, position and frequency) into a smooth contour object. This thesis investigated whether the presence of other flanking objects affected this contour integration of a central target contour. To test this, a set of Gaborized contour shapes were embedded in a randomised Gabor noise field. The detectability of the contours was altered by adjusting the alignment of the Gabor patches in the contour (orientation jitter) until a participant was unable to distinguish between a field with and without a target shape (2-AFC procedure). By varying the magnitude of this jitter, detection thresholds were determined for target contours under various experimental conditions. These thresholds were used to investigate whether contour integration was sensitive to shared shape information between objects across the visual field. This thesis determined that the presence of flanking contours of a similar shape (as the target) facilitated the detection of a noisy target contour. The specific results suggest that this facilitation does not involve a simple template matching or shape priming but is associated with integration of shape level information in the detection of the most likely smooth closed contour. The magnitude of this flanker facilitation effect was sensitive to a number of factors (e.g., numerosity, relative position of the flankers, and perimeter complexity/compactness). The implication of these findings is that the processing of highly localised contrast and orientation information originating from a single object is subject to modulation from other sources of shape information across the whole of the visual field
    corecore