6 research outputs found

    A learning-based framework for depth ordering

    Full text link
    Depth ordering is instrumental for understanding the 3D geometry of an image. We as humans are surprisingly good at depth ordering even with abstract 2D line drawings. In this paper we propose a learning based framework for dis-crete depth ordering inference. Boundary and junction characteristics are important clues for this task, and we have developed new features based on these attributes. Although each feature individ-ually can produce reasonable depth ordering results, they still have limitations, and we can achieve better perfor-mance by combining them. In practice, local depth order-ing inferences can be contradictory. Therefore, we propose a Markov Random Field model with terms that are more global than previous work, and use graph optimization to encourage a globally consistent ordering. In addition, to produce better object segmentation for the task of depth or-dering, we propose to explicitly enforce closed loops and long edges for the occlusion boundary detection. We collect a new depth-order dataset for this problem, including more than a thousand human-labeled images with different daily objects in various configurations. The pro-posed algorithm gives promising performance over conven-tional methods on both synthetic and real scenes. 1

    Occlusion-based depth ordering on monocular images with binary partition tree

    No full text
    This paper proposes a system to relate objects in an image using occlusion cues and arrange them according to depth. The system does not rely on any a priori knowledge of the scene structure and focuses on detecting specific points, such as T-junctions, to infer the depth relationships between objects in the scene. The system makes extensive use of the Binary Partition Tree (BPT) as the segmentation tool jointly with a new approach for T-junction estimation. Following a bottom-up strategy, regions (initially individual pixels) are iteratively merged until only one region is left. At each merging step, the system estimates the probability of observing a T-junction which is a cue of occlusion when three regions meet. When the BPT is constructed and the pruning is performed, this information is used for depth ordering. Although the proposed system only relies on one low-level depth cue and does not involve any learning process, it shows similar performances than the state of the art.Peer Reviewe

    Génération automatique de cartes de profondeur relative par utilisation des occlusions dynamiques

    Get PDF
    L’insuffisance de contenu 3D est un frein majeur à l’expansion des téléviseurs 3D. La generation automatique de contenu 3D à partir de contenu 2D ordinaire constitue une solution possible à ce problème. En effet, plusieurs indices de profondeur sont présents sur des images ou vidéos 2D, ce qui rend la conversion automatique 2D à 3D possible. Parmi ces indices, les occlusions dynamiques, qui permettent d’attribuer un ordre relatif aux objets adjacents, offrent les avantages d’être fiables et présentes dans tous les types de scènes. L’approche pour convertir du contenu 2D en 3D, proposée dans ce mémoire, repose sur l’utilisation de cet indice pour générer des cartes de profondeur relative. L’analyse du mouvement, avant et arrière entre deux trames consécutives, permet le calcul des occlusions dynamiques. Le mouvement considéré est calculé par une version modifiée du flot optique Epic-Flow propose par Revaud et al. en 2015. Les modifications apportées au calcul de ce flot optique ont permis de le rendre cohérent en avant-arrière sans détériorer ses performances. Grâce à cette nouvelle propriété, les occlusions sont plus simplement calculées que dans les approches présentes dans la littérature. En effet, contrairement à l’approche suivie par Salembier et Palou en 2014, la méthode de calcul des occlusions proposée ne nécessite pas la coûteuse opération de l’estimation de mouvement par région selon un modèle quadratique. Une fois les relations d’occlusions obtenues, elles permettent de déduire l’ordre des objets contenus dans l’image. Ces objets sont obtenus par une segmentation qui considère à la fois la couleur et le mouvement. La méthode proposée permet la génération automatique de cartes de profondeur relative en présence de mouvement des objets de la scène. Elle permet d’obtenir des résultats comparables à ceux obtenus par Salembier et Palou, sans nécessiter l’estimation de mouvement par région

    Computational Models of Perceptual Organization and Bottom-up Attention in Visual and Audio-Visual Environments

    Get PDF
    Figure Ground Organization (FGO) - inferring spatial depth ordering of objects in a visual scene - involves determining which side of an occlusion boundary (OB) is figure (closer to the observer) and which is ground (further away from the observer). Attention, the process that governs how only some part of sensory information is selected for further analysis based on behavioral relevance, can be exogenous, driven by stimulus properties such as an abrupt sound or a bright flash, the processing of which is purely bottom-up; or endogenous (goal-driven or voluntary), where top-down factors such as familiarity, aesthetic quality, etc., determine attentional selection. The two main objectives of this thesis are developing computational models of: (i) FGO in visual environments; (ii) bottom-up attention in audio-visual environments. In the visual domain, we first identify Spectral Anisotropy (SA), characterized by anisotropic distribution of oriented high frequency spectral power on the figure side and lack of it on the ground side, as a novel FGO cue, that can determine Figure/Ground (FG) relations at an OB with an accuracy exceeding 60%. Next, we show a non-linear Support Vector Machine based classifier trained on the SA features achieves an accuracy close to 70% in determining FG relations, the highest for a stand-alone local cue. We then show SA can be computed in a biologically plausible manner by pooling the Complex cell responses of different scales in a specific orientation, which also achieves an accuracy greater than or equal to 60% in determining FG relations. Next, we present a biologically motivated, feed forward model of FGO incorporating convexity, surroundedness, parallelism as global cues and SA, T-junctions as local cues, where SA is computed in a biologically plausible manner. Each local cue, when added alone, gives statistically significant improvement in the model's performance. The model with both local cues achieves higher accuracy than those of models with individual cues in determining FG relations, indicating SA and T-Junctions are not mutually contradictory. Compared to the model with no local cues, the model with both local cues achieves greater than or equal to 8.78% improvement in determining FG relations at every border location of images in the BSDS dataset. In the audio-visual domain, first we build a simple computational model to explain how visual search can be aided by providing concurrent, co-spatial auditory cues. Our model shows that adding a co-spatial, concurrent auditory cue can enhance the saliency of a weakly visible target among prominent visual distractors, the behavioral effect of which could be faster reaction time and/or better search accuracy. Lastly, a bottom-up, feed-forward, proto-object based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes is presented. We demonstrate that the performance of proto-object based AVSM in detecting and localizing salient objects/events is in agreement with human judgment. In addition, we show the AVSM computed as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps

    Monocular depth estimation in images and sequences using occlusion cues

    Get PDF
    When humans observe a scene, they are able to perfectly distinguish the different parts composing it. Moreover, humans can easily reconstruct the spatial position of these parts and conceive a consistent structure. The mechanisms involving visual perception have been studied since the beginning of neuroscience but, still today, not all the processes composing it are known. In usual situations, humans can make use of three different methods to estimate the scene structure. The first one is the so called divergence and it makes use of both eyes. When objects lie in front of the observed at a distance up to hundred meters, subtle differences in the image formation in each eye can be used to determine depth. When objects are not in the field of view of both eyes, other mechanisms should be used. In these cases, both visual cues and prior learned information can be used to determine depth. Even if these mechanisms are less accurate than divergence, humans can almost always infer the correct depth structure when using them. As an example of visual cues, occlusion, perspective or object size provide a lot of information about the structure of the scene. A priori information depends on each observer, but it is normally used subconsciously by humans to detect commonly known regions such as the sky, the ground or different types of objects. In the last years, since technology has been able to handle the processing burden of vision systems, there has been lots of efforts devoted to design automated scene interpreting systems. In this thesis we address the problem of depth estimation using only one point of view and using only occlusion depth cues. The thesis objective is to detect occlusions present in the scene and combine them with a segmentation system so as to generate a relative depth order depth map for a scene. We explore both static and dynamic situations such as single images, frame inside sequences or full video sequences. In the case where a full image sequence is available, a system exploiting motion information to recover depth structure is also designed. Results are promising and competitive with respect to the state of the art literature, but there is still much room for improvement when compared to human depth perception performance.Quan els humans observen una escena, son capaços de distingir perfectament les parts que la composen i organitzar-les espacialment per tal de poder-se orientar. Els mecanismes que governen la percepció visual han estat estudiats des dels principis de la neurociència, però encara no es coneixen tots els processos biològic que hi prenen part. En situacions normals, els humans poden fer servir tres eines per estimar l’estructura de l’escena. La primera és l’anomenada divergència. Aprofita l’ús de dos punts de vista (els dos ulls) i és capaç¸ de determinar molt acuradament la posició dels objectes ,que a una distància de fins a cent metres, romanen enfront de l’observador. A mesura que augmenta la distància o els objectes no es troben en el camp de visió dels dos ulls, altres mecanismes s’han d’utilitzar. Tant l’experiència anterior com certs indicis visuals s’utilitzen en aquests casos i, encara que la seva precisió és menor, els humans aconsegueixen quasi bé sempre interpretar bé el seu entorn. Els indicis visuals que aporten informació de profunditat més coneguts i utilitzats són per exemple, la perspectiva, les oclusions o el tamany de certs objectes. L’experiència anterior permet resoldre situacions vistes anteriorment com ara saber quins regions corresponen al terra, al cel o a objectes. Durant els últims anys, quan la tecnologia ho ha permès, s’han intentat dissenyar sistemes que interpretessin automàticament diferents tipus d’escena. En aquesta tesi s’aborda el tema de l’estimació de la profunditat utilitzant només un punt de vista i indicis visuals d’oclusió. L’objectiu del treball es la detecció d’aquests indicis i combinar-los amb un sistema de segmentació per tal de generar automàticament els diferents plans de profunditat presents a una escena. La tesi explora tant situacions estàtiques (imatges fixes) com situacions dinàmiques, com ara trames dins de seqüències de vídeo o seqüències completes. En el cas de seqüències completes, també es proposa un sistema automàtic per reconstruir l’estructura de l’escena només amb informació de moviment. Els resultats del treball son prometedors i competitius amb la literatura del moment, però mostren encara que la visió per computador té molt marge de millora respecte la precisió dels humans
    corecore