101 research outputs found

    Um arcabouço multimodal para geocodificação de objetos digitais

    Get PDF
    Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Informação geográfica é usualmente encontrada em objetos digitais (como documentos, imagens e vídeos), sendo de grande interesse utilizá-la na implementação de diferentes serviços. Por exemplo, serviços de navegação baseados em mapas e buscas geográficas podem se beneficiar das localizações geográficas associadas a objetos digitais. A implementação destes serviços, no entanto, demanda o uso de coleções de dados geocodificados. Este trabalho estuda a combinação de conteúdo textual e visual para geocodificar objetos digitais e propõe um arcabouço de agregação de listas para geocodificação multimodal. A informação textual e visual de vídeos e imagens é usada para definir listas ordenadas. Em seguida, elas são combinadas e a nova lista ordenada resultante é usada para definir a localização geográfica de vídeos e imagens. Uma arquitetura que implementa essa proposta foi projetada de modo que módulos específicos para cada modalidade (e.g., textual ou visual) possam ser aperfeiçoados independentemente. Outro componente é o módulo de fusão responsável pela combinação das listas ordenadas definidas por cada modalidade. Outra contribuição deste trabalho é a proposta de uma nova medida de avaliação da efetividade de métodos de geocodificação chamada Weighted Average Score (WAS). Ela é baseada em ponderações de distâncias que permitem avaliar a efetividade de uma abordagem, considerando todos os resultados de geocodificação das amostras de teste. O arcabouço proposto foi validado em dois contextos: desafio Placing Task da iniciativa MediaEval 2012, que consiste em atribuir, automaticamente, coordenadas geográficas a vídeos; e geocodificação de fotos de prédios da Virginia Tech (VT) nos EUA. No contexto do desafio Placing Task, os resultados mostram como nossa abordagem melhora a geocodificação em comparação a métodos que apenas contam com uma modalidade (sejam descritores textuais ou visuais). Nós mostramos ainda que a proposta multimodal produziu resultados comparáveis às melhores submissões que também não usavam informações adicionais além daquelas disponibilizadas na base de treinamento. Em relação à geocodificação das fotos de prédios da VT, os experimentos demostraram que alguns dos descritores visuais locais produziram resultados efetivos. A seleção desses descritores e sua combinação melhoraram esses resultados quando a base de conhecimento tinha as mesmas características da base de testeAbstract: Geographical information is often enclosed in digital objects (like documents, images, and videos) and its use to support the implementation of different services is of great interest. For example, the implementation of map-based browser services and geographic searches may take advantage of geographic locations associated with digital objects. The implementation of such services, however, demands the use of geocoded data collections. This work investigates the combination of textual and visual content to geocode digital objects and proposes a rank aggregation framework for multimodal geocoding. Textual and visual information associated with videos and images are used to define ranked lists. These lists are later combined, and the new resulting ranked list is used to define appropriate locations. An architecture that implements the proposed framework is designed in such a way that specific modules for each modality (e.g., textual and visual) can be developed and evolved independently. Another component is a data fusion module responsible for combining seamlessly the ranked lists defined for each modality. Another contribution of this work is related to the proposal of a new effectiveness evaluation measure named Weighted Average Score (WAS). The proposed measure is based on distance scores that are combined to assess how effective a designed/tested approach is, considering its overall geocoding results for a given test dataset. We validate the proposed framework in two contexts: the MediaEval 2012 Placing Task, whose objective is to automatically assign geographical coordinates to videos; and the task of geocoding photos of buildings from Virginia Tech (VT), USA. In the context of Placing Task, obtained results show how our multimodal approach improves the geocoding results when compared to methods that rely on a single modality (either textual or visual descriptors). We also show that the proposed multimodal approach yields comparable results to the best submissions to the Placing Task in 2012 using no additional information besides the available development/training data. In the context of the task of geocoding VT building photos, performed experiments demonstrate that some of the evaluated local descriptors yield effective results. The descriptor selection criteria and their combination improved the results when the used knowledge base has the same characteristics of the test setDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

    Georeferencing text using social media

    Get PDF

    GEO-REFERENCED VIDEO RETRIEVAL: TEXT ANNOTATION AND SIMILARITY SEARCH

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Wayang Authoring: A Web-based Authoring Tool for Visual Storytelling for Children

    Get PDF
    This research focuses on the development of the Wayang Authoring tool as it aims to assist children in creating and performing stories, developing an appreciation for cultural artifacts, and enhancing intercultural empathy while building a young storyteller community in a virtual world. This study seeks a framework of interaction design of an authoring media which is appropriate to support children s narrative development. The concept of the tool is based on the narrative element of the ancient Indonesian art form wayang, a traditional two dimensional shadow puppet theater. To understand the user s requirements and the cultural dimension, children and professional story performers who use wayang have been involved in the design process. In order to evaluate the tool, several workshops have been conducted with children from different cultural backgrounds as well as with their teachers. Wayang Authoring is composed of three elements: the imagination-building element, the creative acting element and the social interaction element. Children take existing materials as an inspiration tool, imagine what they themselves want to tell, create a story based on their own ideas, play with their creations, share their stories and creations with others, and reflect on their experiences at the end. This virtual creative production tool is expected to provide a space for young people to change their role from a simple user to a (co-)creator in the virtual and narrative worlds. The core contributions are in the field of web technology for storytelling. The uses of web-based authoring media enable children to put themselves into the process of developing stories. When they are connecting stories, they are connected and immersed with other children as well. They have to act and play by themselves or with others within the stories in order to experience the narratives. They train to have the skills to interact, to share their ideas and to collaborate constructively. This makes it possible for them to participate in today s media-driven culture. This research found that a better understanding of how stories are crafted and brought to life in a performance tradition offers a better design of interaction of an authoring media. The handling of cultural artifacts supports the ability to understand different cultural codes and to pursue the learning process surrounding the original culture behind these artifacts

    Suchbasierte automatische Bildannotation anhand geokodierter Community-Fotos

    Get PDF
    In the Web 2.0 era, platforms for sharing and collaboratively annotating images with keywords, called tags, became very popular. Tags are a powerful means for organizing and retrieving photos. However, manual tagging is time consuming. Recently, the sheer amount of user-tagged photos available on the Web encouraged researchers to explore new techniques for automatic image annotation. The idea is to annotate an unlabeled image by propagating the labels of community photos that are visually similar to it. Most recently, an ever increasing amount of community photos is also associated with location information, i.e., geotagged. In this thesis, we aim at exploiting the location context and propose an approach for automatically annotating geotagged photos. Our objective is to address the main limitations of state-of-the-art approaches in terms of the quality of the produced tags and the speed of the complete annotation process. To achieve these goals, we, first, deal with the problem of collecting images with the associated metadata from online repositories. Accordingly, we introduce a strategy for data crawling that takes advantage of location information and the social relationships among the contributors of the photos. To improve the quality of the collected user-tags, we present a method for resolving their ambiguity based on tag relatedness information. In this respect, we propose an approach for representing tags as probability distributions based on the algorithm of Laplacian score feature selection. Furthermore, we propose a new metric for calculating the distance between tag probability distributions by extending Jensen-Shannon Divergence to account for statistical fluctuations. To efficiently identify the visual neighbors, the thesis introduces two extensions to the state-of-the-art image matching algorithm, known as Speeded Up Robust Features (SURF). To speed up the matching, we present a solution for reducing the number of compared SURF descriptors based on classification techniques, while the accuracy of SURF is improved through an efficient method for iterative image matching. Furthermore, we propose a statistical model for ranking the mined annotations according to their relevance to the target image. This is achieved by combining multi-modal information in a statistical framework based on Bayes' rule. Finally, the effectiveness of each of mentioned contributions as well as the complete automatic annotation process are evaluated experimentally.Seit der Einführung von Web 2.0 steigt die Popularität von Plattformen, auf denen Bilder geteilt und durch die Gemeinschaft mit Schlagwörtern, sogenannten Tags, annotiert werden. Mit Tags lassen sich Fotos leichter organisieren und auffinden. Manuelles Taggen ist allerdings sehr zeitintensiv. Animiert von der schieren Menge an im Web zugänglichen, von Usern getaggten Fotos, erforschen Wissenschaftler derzeit neue Techniken der automatischen Bildannotation. Dahinter steht die Idee, ein noch nicht beschriftetes Bild auf der Grundlage visuell ähnlicher, bereits beschrifteter Community-Fotos zu annotieren. Unlängst wurde eine immer größere Menge an Community-Fotos mit geographischen Koordinaten versehen (geottagged). Die Arbeit macht sich diesen geographischen Kontext zunutze und präsentiert einen Ansatz zur automatischen Annotation geogetaggter Fotos. Ziel ist es, die wesentlichen Grenzen der bisher bekannten Ansätze in Hinsicht auf die Qualität der produzierten Tags und die Geschwindigkeit des gesamten Annotationsprozesses aufzuzeigen. Um dieses Ziel zu erreichen, wurden zunächst Bilder mit entsprechenden Metadaten aus den Online-Quellen gesammelt. Darauf basierend, wird eine Strategie zur Datensammlung eingeführt, die sich sowohl der geographischen Informationen als auch der sozialen Verbindungen zwischen denjenigen, die die Fotos zur Verfügung stellen, bedient. Um die Qualität der gesammelten User-Tags zu verbessern, wird eine Methode zur Auflösung ihrer Ambiguität vorgestellt, die auf der Information der Tag-Ähnlichkeiten basiert. In diesem Zusammenhang wird ein Ansatz zur Darstellung von Tags als Wahrscheinlichkeitsverteilungen vorgeschlagen, der auf den Algorithmus der sogenannten Laplacian Score (LS) aufbaut. Des Weiteren wird eine Erweiterung der Jensen-Shannon-Divergence (JSD) vorgestellt, die statistische Fluktuationen berücksichtigt. Zur effizienten Identifikation der visuellen Nachbarn werden in der Arbeit zwei Erweiterungen des Speeded Up Robust Features (SURF)-Algorithmus vorgestellt. Zur Beschleunigung des Abgleichs wird eine Lösung auf der Basis von Klassifikationstechniken präsentiert, die die Anzahl der miteinander verglichenen SURF-Deskriptoren minimiert, während die SURF-Genauigkeit durch eine effiziente Methode des schrittweisen Bildabgleichs verbessert wird. Des Weiteren wird ein statistisches Modell basierend auf der Baye'schen Regel vorgeschlagen, um die erlangten Annotationen entsprechend ihrer Relevanz in Bezug auf das Zielbild zu ranken. Schließlich wird die Effizienz jedes einzelnen, erwähnten Beitrags experimentell evaluiert. Darüber hinaus wird die Performanz des vorgeschlagenen automatischen Annotationsansatzes durch umfassende experimentelle Studien als Ganzes demonstriert

    Automated Building Information Extraction and Evaluation from High-resolution Remotely Sensed Data

    Get PDF
    The two-dimensional (2D) footprints and three-dimensional (3D) structures of buildings are of great importance to city planning, natural disaster management, and virtual environmental simulation. As traditional manual methodologies for collecting 2D and 3D building information are often both time consuming and costly, automated methods are required for efficient large area mapping. It is challenging to extract building information from remotely sensed data, considering the complex nature of urban environments and their associated intricate building structures. Most 2D evaluation methods are focused on classification accuracy, while other dimensions of extraction accuracy are ignored. To assess 2D building extraction methods, a multi-criteria evaluation system has been designed. The proposed system consists of matched rate, shape similarity, and positional accuracy. Experimentation with four methods demonstrates that the proposed multi-criteria system is more comprehensive and effective, in comparison with traditional accuracy assessment metrics. Building height is critical for building 3D structure extraction. As data sources for height estimation, digital surface models (DSMs) that are derived from stereo images using existing software typically provide low accuracy results in terms of rooftop elevations. Therefore, a new image matching method is proposed by adding building footprint maps as constraints. Validation demonstrates that the proposed matching method can estimate building rooftop elevation with one third of the error encountered when using current commercial software. With an ideal input DSM, building height can be estimated by the elevation contrast inside and outside a building footprint. However, occlusions and shadows cause indistinct building edges in the DSMs generated from stereo images. Therefore, a “building-ground elevation difference model” (EDM) has been designed, which describes the trend of the elevation difference between a building and its neighbours, in order to find elevation values at bare ground. Experiments using this novel approach report that estimated building height with 1.5m residual, which out-performs conventional filtering methods. Finally, 3D buildings are digitally reconstructed and evaluated. Current 3D evaluation methods did not present the difference between 2D and 3D evaluation methods well; traditionally, wall accuracy is ignored. To address these problems, this thesis designs an evaluation system with three components: volume, surface, and point. As such, the resultant multi-criteria system provides an improved evaluation method for building reconstruction

    Irish Machine Vision and Image Processing Conference, Proceedings

    Get PDF

    Fine-grained Incident Video Retrieval with Video Similarity Learning.

    Get PDF
    PhD ThesesIn this thesis, we address the problem of Fine-grained Incident Video Retrieval (FIVR) using video similarity learning methods. FIVR is a video retrieval task that aims to retrieve all videos that depict the same incident given a query video { related video retrieval tasks adopt either very narrow or very broad scopes, considering only nearduplicate or same event videos. To formulate the case of same incident videos, we de ne three video associations taking into account the spatio-temporal spans captured by video pairs. To cover the benchmarking needs of FIVR, we construct a large-scale dataset, called FIVR-200K, consisting of 225,960 YouTube videos from major news events crawled from Wikipedia. The dataset contains four annotation labels according to FIVR de nitions; hence, it can simulate several retrieval scenarios with the same video corpus. To address FIVR, we propose two video-level approaches leveraging features extracted from intermediate layers of Convolutional Neural Networks (CNN). The rst is an unsupervised method that relies on a modi ed Bag-of-Word scheme, which generates video representations from the aggregation of the frame descriptors based on learned visual codebooks. The second is a supervised method based on Deep Metric Learning, which learns an embedding function that maps videos in a feature space where relevant video pairs are closer than the irrelevant ones. However, videolevel approaches generate global video representations, losing all spatial and temporal relations between compared videos. Therefore, we propose a video similarity learning approach that captures ne-grained relations between videos for accurate similarity calculation. We train a CNN architecture to compute video-to-video similarity from re ned frame-to-frame similarity matrices derived from a pairwise region-level similarity function. The proposed approaches have been extensively evaluated on FIVR- 200K and other large-scale datasets, demonstrating their superiority over other video retrieval methods and highlighting the challenging aspect of the FIVR problem
    corecore