23 research outputs found

    ImageNet Large Scale Visual Recognition Challenge

    Get PDF
    The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL VOC (per-category comparisons in Table 3, distribution of localization difficulty in Fig 16), a list of queries used for obtaining object detection images (Appendix C), and some additional reference

    I am here - are you there? Sense of presence and implications for virtual world design

    Get PDF
    We use the language of presence and place when we interact online: in our instant text messaging windows we often post: Are you there? Research indicates the importance of the sense of presence for computer-supported collaborative virtual learning. To realize the potential of virtual worlds such as Second Life, which may have advantages over conventional text-based environments, we need an understanding of design and the emergence of the sense of presence. A construct was created for the sense of presence, as a collaborative, action-based process (Spagnolli, Varotto, & Mantovani, 2003) with four dimensions (sense of place, social presence, individual agency, and mediated collaborative actions). Nine design principles were mapped against the four dimensions. The guiding question for the study\u27s exploration of the sense of presence was: In the virtual world Second Life, what is the effect on the sense of presence in collaborative learning spaces designed according to the sense of presence construct proposed, using two of the nine design principles, wayfinding and annotation? Another question of interest was: What are the relationships, if any, among the four dimensions of presence? The research utilized both quantitative and qualitative measures. Twenty learners recruited from the Graduate School of Education and Psychology at Pepperdine University carried out three assigned collaborative activities in Second Life under design conditions foregrounding each of the two design conditions, and a combination of the two. Analyses from surveys, Second Life interactions, interviews and a focus group were conducted to investigate how various designed learning environments based in the virtual world contributed to the sense of presence, and to learners\u27 ability to carry out collaborative learning. The major research findings were: (a) the construct appears robust, and future research in its application to other virtual worlds may be fruitful; (b) the experience of wayfinding (finding a path through a virtual space) resulted overall in an observed pattern of a slightly stronger sense of place; (c) the experience of annotation (building) resulted overall in an observed pattern of a slightly stronger sense of agency; and (d) there is a positive association between sense of place and sense of agency

    3D visualization of cadastre : assessing the suitability of visual variables and enhancement techniques in the 3D model of condominium property units

    Get PDF
    La visualisation 3D de données cadastrales a été exploitée dans de nombreuses études, car elle offre de nouvelles possibilités d’examiner des situations de supervision verticale des propriétés. Les chercheurs actifs dans ce domaine estiment que la visualisation 3D pourrait fournir aux utilisateurs une compréhension plus intuitive d’une situation où des propriétés se superposent, ainsi qu’une plus grande capacité et avec moins d’ambiguïté de montrer des problèmes potentiels de chevauchement des unités de propriété. Cependant, la visualisation 3D est une approche qui apporte de nombreux défis par rapport à la visualisation 2D. Les précédentes recherches effectuées en cadastre 3D, et qui utilisent la visualisation 3D, ont très peu enquêté l’impact du choix des variables visuelles (ex. couleur, style) sur la prise de décision. Dans l’optique d'améliorer la visualisation 3D de données cadastres, cette thèse de doctorat examine l’adéquation du choix des variables visuelles et des techniques de rehaussement associées afin de produire un modèle de condominium 3D optimal, et ce, en fonction de certaines tâches spécifiques de visualisation. Les tâches visées sont celles dédiées à la compréhension dans l’espace 3D des limites de propriété du condominium. En ce sens, ce sont principalement des tâches notariales qui ont été ciblées. De plus, cette thèse va mettre en lumière les différences de l’impact des variables visuelles entre une visualisation 2D et 3D. Cette thèse identifie dans un premier temps un cadre théorique pour l'interprétation des variables visuelles dans le contexte d’une visualisation 3D et de données cadastrales au regard d’une revue de littéraire. Dans un deuxième temps, des expérimentations ont été réalisées afin de mettre à l’épreuve la performance des variables visuelles (ex. couleur, valeur, texture) et des techniques de rehaussement (transparence, annotation, déplacement). Trois approches distinctes ont été utilisées : 1) discussion directe avec des personnes œuvrant en géomatique, 2) entrevue face à face avec des notaires et 3) questionnaire en ligne avec des groupes ciblés. L’utilisabilité mesurée en termes d’efficacité, d’efficience et de degré de satisfaction a servi aux comparaisons des expérimentations. Les principaux résultats de cette recherche sont : 1) Une liste de tâches visuelles notariales utiles à la délimitation des unités de propriété dans le contexte de la visualisation 3D de condominium ; 2) Des recommandations quant à l'adéquation de huit variables visuelles et de trois techniques de rehaussement afin d’optimiser la réalisation d’un certain nombre de tâches notariales ; 3) Une analyse comparative de la performance de ces variables entre une visualisation 2D et 3D.3D visualization is being widely used in GIS (geographic information system) and CAD (computer-aided design) applications. It has also been introduced in cadastre studies to better communicate overlaps to the viewer, where the property units vertically stretch over or cover one part of the land parcel. Researchers believe that 3D visualization could provide viewers with a more intuitive perception, and it has the capability to demonstrate overlapping property units in condominiums unambiguously. However, 3D visualization has many challenges compared with 2D visualization. Many cadastre researchers adopted 3D visualization without thoroughly investigating the potential users, the visual tasks for decision-making, and the appropriateness of their representation design. Neither designers nor users may be aware of the risk of producing an inadequate 3D visualization, especially in an era when 3D visualization is relatively novel in the cadastre domain. With a general aim to improve the 3D visualization of cadastre data, this dissertation addresses the design of the 3D cadastre model from a graphics semiotics viewpoint including visual variables and enhancement techniques. The research questions are, firstly, what is the suitability of the visual variables and enhancement techniques in the 3D cadastre model to support the intended users' decision-making goal of delimitating condominium property units, and secondly, what are the perceptual properties of visual variables in 3D visualization compared with 2D visualization? This dissertation firstly identifies the theoretical framework for the interpretation of visual variables in 3D visualization as well as cadastre-related knowledge with literature review. Then, we carry out a preliminary evaluation of the feasibility of visual variables and enhancement techniques in a form of an expert-group review. With the result of the preliminary evaluation, this research then performs the hypothetico-deductive scientific approach to establishing a list of hypotheses to be validated by empirical tests regarding the suitability of visual variables and enhancement techniques in a cartographic representation of property units in condominiums for 3D visualization. The evaluation is based on the usability specification, which contains three measurements: effectiveness, efficiency, and preference. Several empirical tests are conducted with cadastral users in the forms of face-to-face interviews and online questionnaires, followed by statistical analysis. Size, shape, brightness, saturation, hue, orientation, texture, and transparency are the most discussed and used visual variables in existing cartographic research and implementations; thus, these eight visual variables have been involved in the tests. Their perceptual properties exhibited in the empirical test with concrete 3D models in this work are compared with those in a 2D visualization, which is derived from a literature-based synthesis. Three enhancement techniques, including labeling, 3D explosion, and highlighting, are tested as well. There are three main outcomes of this work. First, we established a list of visual tasks adapted to notaries for delimiting property units in the context of 3D visualization of condominium cadastres. Second, we describe the suitability of eight visual variables (Size, Shape, Brightness, Saturation, Hue, Orientation, Texture, and Transparency) of the property units and three enhancement techniques (labeling, 3D explosion and highlighting) in the context of 3D visualisation of condominium property units, based on the usability specification for delimitating visual tasks. For example, brightness only shows good performance in helping users distinguish private and common parts in the context of 3D visualization of property units in condominiums. As well, color hue and saturation are effective and preferable. The third outcome is a statement of the perceptual properties’ differences of visual variables between 3D visualization and 2D visualization. For example, according to Bertin (1983)’s definition, orientation is associative and selective in 2D, yet it does not perform in a 3D visualization. In addition, 3D visualization affects the performance of brightness, making it marginally dissociative and selective

    Articulated people detection and pose estimation in challenging real world environments

    Get PDF
    In this thesis we are interested in the problem of articulated people detection and pose estimation being key ingredients towards understanding visual scenes containing people. First, we investigate how statistical 3D human shape models from computer graphics can be leveraged to ease training data generation. Second, we develop expressive models for 2D single- and multi-person pose estimation. Third, we introduce a novel human pose estimation benchmark that makes a significant advance in terms of diversity and difficulty. Thorough experimental evaluation on standard benchmarks demonstrates significant improvements due to the proposed data augmentation techniques and novel body models, while detailed performance analysis of competing approaches on our novel benchmark allows to identify the most promising directions of improvement.In dieser Arbeit untersuchen wir das Problem der artikulierten Detektion und Posenschätzung von Personen als Schlüsselkomponenten des Verstehens von visuellen Szenen mit Personen. Obwohl es umfangreiche Bemühungen gibt, die Lösung dieser Probleme anzugehen, haben wir drei vielversprechende Herangehensweisen ermittelt, die unserer Meinung nach bisher nicht ausreichend beachtet wurden. Erstens untersuchen wir, wie statistische 3 D Modelle des menschlichen Umrisses, die aus der Computergrafik stammen, wirksam eingesetzt werden können, um die Generierung von Trainingsdaten zu erleichtern. Wir schlagen eine Reihe von Techniken zur automatischen Datengenerierung vor, die eine direkte Repräsentation relevanter Variationen in den Trainingsdaten erlauben. Indem wir Stichproben aus der zu Grunde liegenden Verteilung des menschlichen Umrisses und aus einem großen Datensatz von menschlichen Posen ziehen, erzeugen wir eine neue für unsere Aufgabe relevante Auswahl mit regulierbaren Variationen von Form und Posen. Darüber hinaus verbessern wir das neueste 3 D Modell des menschlichen Umrisses selbst, indem wir es aus einem großen handelsüblichen Datensatz von 3 D Körpern neu aufbauen. Zweitens entwickeln wir ausdrucksstarke räumliche Modelle und ErscheinungsbildModelle für die 2 D Posenschätzung einzelner und mehrerer Personen. Wir schlagen ein ausdrucksstarkes Einzelperson-Modell vor, das Teilabhängigkeiten höherer Ordnung einbezieht, aber dennoch effizient bleibt. Wir verstärken dieses Modell durch verschiedene Arten von starken Erscheinungsbild-Repräsentationen, um die Körperteilhypothesen erheblich zu verbessern. Schließlich schlagen wir ein ausdruckstarkes Modell zur gemeinsamen Posenschätzung mehrerer Personen vor. Dazu entwickeln wir starke Deep Learning-basierte Körperteildetektoren und ein ausdrucksstarkes voll verbundenes räumliches Modell. Der vorgeschlagene Ansatz behandelt die Posenschätzung mehrerer Personen als ein Problem der gemeinsamen Aufteilung und Annotierung eines Satzes von Körperteilhypothesen: er erschließt die Anzahl von Personen in einer Szene, identifiziert verdeckte Körperteile und unterscheidet eindeutig Körperteile von Personen, die sich nahe beieinander befinden. Drittens führen wir eine gründliche Bewertung und Performanzanalyse führender Methoden der menschlichen Posenschätzung und Aktivitätserkennung durch. Dazu stellen wir einen neuen Benchmark vor, der einen bedeutenden Fortschritt bezüglich Diversität und Schwierigkeit im Vergleich zu bisherigen Datensätzen mit sich bringt und über 40 . 000 annotierte Körperposen und mehr als 1 . 5 Millionen Einzelbilder enthält. Darüber hinaus stellen wir einen reichhaltigen Satz an Annotierungen zur Verfügung, die zu einer detaillierten Analyse konkurrierender Herangehensweisen benutzt werden, wodurch wir Erkenntnisse zu Erfolg und Mißerfolg dieser Methoden erhalten. Zusammengefasst präsentiert diese Arbeit einen neuen Ansatz zur artikulierten Detektion und Posenschätzung von Personen. Eine gründliche experimentelle Evaluation auf Standard-Benchmarkdatensätzen zeigt signifikante Verbesserungen durch die vorgeschlagenen Datenverstärkungstechniken und neuen Körpermodelle, während eine detaillierte Performanzanalyse konkurrierender Herangehensweisen auf unserem neu vorgestellten großen Benchmark uns erlaubt, die vielversprechendsten Bereiche für Verbesserungen zu erkennen

    Semantic annotation services for 3D models of cultural heritage artefacts

    Get PDF

    Using 3D Visual Data to Build a Semantic Map for Autonomous Localization

    Get PDF
    Environment maps are essential for robots and intelligent gadgets to autonomously carry out tasks. Traditional maps built by visual sensors include metric ones and topological ones. These maps are navigation-oriented and not adequate for service robots or intelligent gadgets to interact with or serve human users who normally rely on conceptual knowledge or semantic contents of the environment. Therefore, semantic maps become necessary for building an effective human-robot interface. Although researchers from both robotics and computer vision domains have designed some promising systems, mapping with high accuracy and how to use semantic information for localization remain challenging. This thesis describes several novel methodologies to address these problems. RGB-D visual data is used as system input. Deep learning techniques and SLAM algorithms are combined in order to achieve better system performance. Firstly, a traditional feature based semantic mapping approach is presented. A novel matching error rejection algorithm is proposed to increase both loop closure detection and semantic information extraction accuracy. Evaluational experiments on public benchmark dataset are carried out to analyze the system performance. Secondly, a visual odometry system based on a Recurrent Convolutional Neural Network is presented for more accurate and robust camera motion estimation. The proposed network deploys an unsupervised end-to-end framework. The output transformation matrices are on an absolute scale, i.e. true scale in the real world. No data labeling or matrix post-processing tasks are required. Experiments show the proposed system outperforms other state-of-the-art VO systems. Finally, a novel topological localization approach based on the pre-built semantic maps is presented. Two streams of Convolutional Neural Networks are applied to infer locations. The additional semantic information in the maps is inversely used to further verify localization results. Experiments show the system is robust to viewpoint, lighting condition and object changes

    Il rilievo dell’Architettura tra identificazione tipologica e strutturazione semantica. La Certosa di San Lorenzo a Padula nella rappresentazione digitale per il Cultural Heritage.

    Get PDF
    La digitalizzazione delle testimonianze del patrimonio costruito ha animato una nuova sfida dell’epoca contemporanea: la creazione di sistemi intelligenti, efficienti e scalabili per l’indicizzazione, l’archiviazione, la ricerca e la gestione delle collezioni digitali e della relativa documentazione. La strutturazione delle informazioni, spazializzate in ragione di attributi semantici, risponde efficacemente alle esigenze di documentazione e interoperabilità, proprie degli studi interdisciplinari, potenziando la condivisione delle risorse e la creazione collaborativa di conoscenza. L’analisi degli approcci formalizzati finora, mostra diverse questioni aperte e spunti di riflessione connessi con le implicazioni teoriche di una segmentazione architettonica guidata dalla semantica, con le potenzialità espressive e applicative di una interazione più immediata con la rappresentazione digitale nonché con le modalità di accesso e recupero della documentazione spazializzata. Un primo scopo della ricerca è la definizione di una metodologia di annotazione semantica che risolva l’ambiguità e codifichi l’incertezza della etichettatura, affinando la qualità dell’informazione. Una seconda finalità è l’indagine di approcci che razionalizzino e ottimizzino il processo di annotazione, migliorando l’interoperabilità e l’interpretabilità da parte di Intelligenze Artificiali

    e-Participation: Promoting Dialogue and Deliberation between Institutions and Civil Society

    Get PDF
    Information and Communication Technologies (ICT) are increasingly becoming more pervasive of peoples lives, both for individual and collective usage. Hence, it becomes tempting to develop electronic tools that can constitute alternatives to enhance citizenship, in particular tools that can be used in the context of societal debates of public policies. This report explores the conditions to deploying electronic based public participation methodologies and online ICT based participatory processes within public policy processes. It reflects on the challenges, promises and motivations of using ICT to promote dialogue and deliberation among institutions and the civil society. It consists in part of a revision of the techniques and tools commonly used in electronic public participation processes, referring to case studies where these techniques and tools were employed. Special attention was given to cases where the outcome from the public participation process supported a decision process. The conditions of deployment of e-participation processes in public policy formulation were framed within the concept of quality, specifically in the concept of ¿fitness for purpose¿. An electronic tool was designed and implemented, not only featuring dialogue components, but also collaborative ones, as response to some of the identified challenges. Based on the review of e-participation tools and the preliminary usage of the tool developed, a protocol for quality assurance of e-participation tools is offered.JRC.G.9-Econometrics and statistical support to antifrau

    From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

    Get PDF
    [cat] L’anàlisi visual de persones a partir d'imatges és un tema de recerca molt important, atesa la rellevància que té a una gran quantitat d'aplicacions dins la visió per computador, com per exemple: detecció de vianants, monitorització i vigilància,interacció persona-màquina, “e-salut” o sistemes de recuperació d’matges a partir de contingut, entre d'altres. En aquesta tesi volem aprendre diferents representacions visuals del cos humà, que siguin útils per a la anàlisi visual de persones en imatges i vídeos. Per a tal efecte, analitzem diferents modalitats d'imatge com són les imatges de color RGB i les imatges de profunditat, i adrecem el problema a diferents nivells d'abstracció, des dels píxels fins als gestos: segmentació de persones, estimació de la pose humana i reconeixement de gestos. Primer, mostrem com la segmentació binària (objecte vs. fons) del cos humà en seqüències d'imatges ajuda a eliminar soroll pertanyent al fons de l'escena en qüestió. El mètode presentat, basat en optimització “Graph cuts”, imposa consistència espai-temporal a Ies màscares de segmentació obtingudes en “frames” consecutius. En segon lloc, presentem un marc metodològic per a la segmentació multi-classe, amb la qual podem obtenir una descripció més detallada del cos humà, en comptes d'obtenir una simple representació binària separant el cos humà del fons, podem obtenir màscares de segmentació més detallades, separant i categoritzant les diferents parts del cos. A un nivell d'abstraccíó més alt, tenim com a objectiu obtenir representacions del cos humà més simples, tot i ésser suficientment descriptives. Els mètodes d'estimació de la pose humana sovint es basen en models esqueletals del cos humà, formats per segments (o rectangles) que representen les extremitats del cos, connectades unes amb altres seguint les restriccions cinemàtiques del cos humà. A la pràctica, aquests models esqueletals han de complir certes restriccions per tal de poder aplicar mètodes d'inferència que permeten trobar la solució òptima de forma eficient, però a la vegada aquestes restriccions suposen una gran limitació en l'expressivitat que aques.ts models son capaços de capturar. Per tal de fer front a aquest problema, proposem un enfoc “top-down” per a predir la posició de les parts del cos del model esqueletal, introduïnt una representació de parts de mig nivell basada en “Poselets”. Finalment. proposem un marc metodològic per al reconeixement de gestos, basat en els “bag of visual words”. Aprofitem els avantatges de les imatges RGB i les imatges; de profunditat combinant vocabularis visuals específiques per a cada modalitat, emprant late fusion. Proposem un nou descriptor per a imatges de profunditat invariant a rotació, que millora l'estat de l'art, i fem servir piràmides espai-temporals per capturar certa estructura espaial i temporal dels gestos. Addicionalment, presentem una reformulació probabilística del mètode “Dynamic Time Warping” per al reconeixement de gestos en seqüències d'imatges. Més específicament, modelem els gestos amb un model probabilistic gaussià que implícitament codifica possibles deformacions tant en el domini espaial com en el temporal.[eng] The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications like pedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RCB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on “Graph cuts” optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body, In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually Iimiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences, A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains
    corecore