135 research outputs found

    Um arcabouço multimodal para geocodificação de objetos digitais

    Get PDF
    Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Informação geográfica é usualmente encontrada em objetos digitais (como documentos, imagens e vídeos), sendo de grande interesse utilizá-la na implementação de diferentes serviços. Por exemplo, serviços de navegação baseados em mapas e buscas geográficas podem se beneficiar das localizações geográficas associadas a objetos digitais. A implementação destes serviços, no entanto, demanda o uso de coleções de dados geocodificados. Este trabalho estuda a combinação de conteúdo textual e visual para geocodificar objetos digitais e propõe um arcabouço de agregação de listas para geocodificação multimodal. A informação textual e visual de vídeos e imagens é usada para definir listas ordenadas. Em seguida, elas são combinadas e a nova lista ordenada resultante é usada para definir a localização geográfica de vídeos e imagens. Uma arquitetura que implementa essa proposta foi projetada de modo que módulos específicos para cada modalidade (e.g., textual ou visual) possam ser aperfeiçoados independentemente. Outro componente é o módulo de fusão responsável pela combinação das listas ordenadas definidas por cada modalidade. Outra contribuição deste trabalho é a proposta de uma nova medida de avaliação da efetividade de métodos de geocodificação chamada Weighted Average Score (WAS). Ela é baseada em ponderações de distâncias que permitem avaliar a efetividade de uma abordagem, considerando todos os resultados de geocodificação das amostras de teste. O arcabouço proposto foi validado em dois contextos: desafio Placing Task da iniciativa MediaEval 2012, que consiste em atribuir, automaticamente, coordenadas geográficas a vídeos; e geocodificação de fotos de prédios da Virginia Tech (VT) nos EUA. No contexto do desafio Placing Task, os resultados mostram como nossa abordagem melhora a geocodificação em comparação a métodos que apenas contam com uma modalidade (sejam descritores textuais ou visuais). Nós mostramos ainda que a proposta multimodal produziu resultados comparáveis às melhores submissões que também não usavam informações adicionais além daquelas disponibilizadas na base de treinamento. Em relação à geocodificação das fotos de prédios da VT, os experimentos demostraram que alguns dos descritores visuais locais produziram resultados efetivos. A seleção desses descritores e sua combinação melhoraram esses resultados quando a base de conhecimento tinha as mesmas características da base de testeAbstract: Geographical information is often enclosed in digital objects (like documents, images, and videos) and its use to support the implementation of different services is of great interest. For example, the implementation of map-based browser services and geographic searches may take advantage of geographic locations associated with digital objects. The implementation of such services, however, demands the use of geocoded data collections. This work investigates the combination of textual and visual content to geocode digital objects and proposes a rank aggregation framework for multimodal geocoding. Textual and visual information associated with videos and images are used to define ranked lists. These lists are later combined, and the new resulting ranked list is used to define appropriate locations. An architecture that implements the proposed framework is designed in such a way that specific modules for each modality (e.g., textual and visual) can be developed and evolved independently. Another component is a data fusion module responsible for combining seamlessly the ranked lists defined for each modality. Another contribution of this work is related to the proposal of a new effectiveness evaluation measure named Weighted Average Score (WAS). The proposed measure is based on distance scores that are combined to assess how effective a designed/tested approach is, considering its overall geocoding results for a given test dataset. We validate the proposed framework in two contexts: the MediaEval 2012 Placing Task, whose objective is to automatically assign geographical coordinates to videos; and the task of geocoding photos of buildings from Virginia Tech (VT), USA. In the context of Placing Task, obtained results show how our multimodal approach improves the geocoding results when compared to methods that rely on a single modality (either textual or visual descriptors). We also show that the proposed multimodal approach yields comparable results to the best submissions to the Placing Task in 2012 using no additional information besides the available development/training data. In the context of the task of geocoding VT building photos, performed experiments demonstrate that some of the evaluated local descriptors yield effective results. The descriptor selection criteria and their combination improved the results when the used knowledge base has the same characteristics of the test setDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

    Georeferencing flickr resources based on textual meta-data

    Get PDF
    The task of automatically estimating the location of web resources is of central importance in location-based services on the Web. Much attention has been focused on Flickr photos and videos, for which it was found that language modeling approaches are particularly suitable. In particular, state-of-the art systems for georeferencing Flickr photos tend to cluster the locations on Earth in a relatively small set of disjoint regions, apply feature selection to identify location-relevant tags, then use a form of text classification to identify which area is most likely to contain the true location of the resource, and finally attempt to find an appropriate location within the identified area. In this paper, we present a systematic discussion of each of the aforementioned components, based on the lessons we have learned from participating in the 2010 and 2011 editions of MediaEval’s Placing Task. Extensive experimental results allow us to analyze why certain methods work well on this task and show that a median error of just over 1 km can be achieved on a standard benchmark test set

    Event Based Media Indexing

    Get PDF
    Multimedia data, being multidimensional by its nature, requires appropriate approaches for its organizing and sorting. The growing number of sensors for capturing the environmental conditions in the moment of media creation enriches data with context-awareness. This unveils enormous potential for eventcentred multimedia processing paradigm. The essence of this paradigm lies in using events as the primary means for multimedia integration, indexing and management. Events have the ability to semantically encode relationships of different informational modalities. These modalities can include, but are not limited to: time, space, involved agents and objects. As a consequence, media processing based on events facilitates information perception by humans. This, in turn, decreases the individual’s effort for annotation and organization processes. Moreover events can be used for reconstruction of missing data and for information enrichment. The spatio-temporal component of events is a key to contextual analysis. A variety of techniques have recently been presented to leverage contextual information for event-based analysis in multimedia. The content-based approach has demonstrated its weakness in the field of event analysis, especially for the event detection task. However content-based media analysis is important for object detection and recognition and can therefore play a role which is complementary to that of event-driven context recognition. The main contribution of the thesis lies in the investigation of a new eventbased paradigm for multimedia integration, indexing and management. In this dissertation we propose i) a novel model for event based multimedia representation, ii) a robust approach for mining events from multimedia and iii) exploitation of detected events for data reconstruction and knowledge enrichment

    Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain

    Full text link
    Human action recognition has become one of the most active field of research in computer vision due to its wide range of applications, like surveillance, medical, industrial environments, smart homes, among others. Recently, deep learning has been successfully used to learn powerful and interpretable features for recognizing human actions in videos. Most of the existing deep learning approaches have been designed for processing video information as RGB image sequences. For this reason, a preliminary decoding process is required, since video data are often stored in a compressed format. However, a high computational load and memory usage is demanded for decoding a video. To overcome this problem, we propose a deep neural network capable of learning straight from compressed video. Our approach was evaluated on two public benchmarks, the UCF-101 and HMDB-51 datasets, demonstrating comparable recognition performance to the state-of-the-art methods, with the advantage of running up to 2 times faster in terms of inference speed

    Aisthēsis

    Get PDF
    This exegesis explores the idea that subjectivity is mapped via physical sensations known in their raw state as ‘quales’, or qualia: that it arises from sensations being filtered through bodies. It explores the role of perception, which is seen as a hinge between sensation and the language we attach to it. While modern aesthetics are arguably dominated by visuality, the Ancient Greek concept of Aisthēsis included the full sensory spectrum. It is this more complete phenomenological approach to painting that I will explore in terms of my practice; the way the process of structuring sensations is modelled in the making of art. Written in three sections I will outline my methodological approach, perform these concepts in narrative form and finally apply them to my body of work to provide a broader aesthetic framework

    Context Aware Textual Entailment

    Get PDF
    In conversations, stories, news reporting, and other forms of natural language, understanding requires participants to make assumptions (hypothesis) based on background knowledge, a process called entailment. These assumptions may then be supported, contradicted, or refined as a conversation or story progresses and additional facts become known and context changes. It is often the case that we do not know an aspect of the story with certainty but rather believe it to be the case; i.e., what we know is associated with uncertainty or ambiguity. In this research a method has been developed to identify different contexts of the input raw text along with specific features of the contexts such as time, location, and objects. The method includes a two-phase SVM classifier along with a voting mechanism in the second phase to identify the contexts. Rule-based algorithms were utilized to extract the context elements. This research also develops a new context˗aware text representation. This representation maintains semantic aspects of sentences, as well as textual contexts and context elements. The method can offer both graph representation and First-Order-Logic representation of the text. This research also extracts a First-Order Logic (FOL) and XML representation of a text or series of texts. The method includes entailment using background knowledge from sources (VerbOcean and WordNet), with resolution of conflicts between extracted clauses, and handling the role of context in resolving uncertain truth

    GEO-REFERENCED VIDEO RETRIEVAL: TEXT ANNOTATION AND SIMILARITY SEARCH

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore