798 research outputs found

    Learning to Evaluate Performance of Multi-modal Semantic Localization

    Full text link
    Semantic localization (SeLo) refers to the task of obtaining the most relevant locations in large-scale remote sensing (RS) images using semantic information such as text. As an emerging task based on cross-modal retrieval, SeLo achieves semantic-level retrieval with only caption-level annotation, which demonstrates its great potential in unifying downstream tasks. Although SeLo has been carried out successively, but there is currently no work has systematically explores and analyzes this urgent direction. In this paper, we thoroughly study this field and provide a complete benchmark in terms of metrics and testdata to advance the SeLo task. Firstly, based on the characteristics of this task, we propose multiple discriminative evaluation metrics to quantify the performance of the SeLo task. The devised significant area proportion, attention shift distance, and discrete attention distance are utilized to evaluate the generated SeLo map from pixel-level and region-level. Next, to provide standard evaluation data for the SeLo task, we contribute a diverse, multi-semantic, multi-objective Semantic Localization Testset (AIR-SLT). AIR-SLT consists of 22 large-scale RS images and 59 test cases with different semantics, which aims to provide a comprehensive evaluations for retrieval models. Finally, we analyze the SeLo performance of RS cross-modal retrieval models in detail, explore the impact of different variables on this task, and provide a complete benchmark for the SeLo task. We have also established a new paradigm for RS referring expression comprehension, and demonstrated the great advantage of SeLo in semantics through combining it with tasks such as detection and road extraction. The proposed evaluation metrics, semantic localization testsets, and corresponding scripts have been open to access at github.com/xiaoyuan1996/SemanticLocalizationMetrics .Comment: 19 pages, 11 figure

    A Review of Deep Learning Methods and Applications for Unmanned Aerial Vehicles

    Get PDF
    Deep learning is recently showing outstanding results for solving a wide variety of robotic tasks in the areas of perception, planning, localization, and control. Its excellent capabilities for learning representations from the complex data acquired in real environments make it extremely suitable for many kinds of autonomous robotic applications. In parallel, Unmanned Aerial Vehicles (UAVs) are currently being extensively applied for several types of civilian tasks in applications going from security, surveillance, and disaster rescue to parcel delivery or warehouse management. In this paper, a thorough review has been performed on recent reported uses and applications of deep learning for UAVs, including the most relevant developments as well as their performances and limitations. In addition, a detailed explanation of the main deep learning techniques is provided. We conclude with a description of the main challenges for the application of deep learning for UAV-based solutions

    Automatic Caption Generation for Aerial Images: A Survey

    Get PDF
    Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed

    Self-supervised Audiovisual Representation Learning for Remote Sensing Data

    Get PDF
    Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map samples from both modalities into a common embedding space, which encourages the models to understand key properties of a scene that influence both visual and auditory appearance. To validate the usefulness of the proposed approach, we evaluate the transfer learning performance of pre-trained weights obtained against weights obtained through other means. By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery. The dataset, code and pre-trained model weights will be available at this URL: https://github.com/khdlr/SoundingEarth

    Aprendizado de representações e correspondências baseadas em grafos para tarefas de classificação

    Get PDF
    Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Muitas situações do mundo real podem ser modeladas por meio de objetos e seus relacionamentos, como, por exemplo, estradas conectando cidades em um mapa. Grafo é um conceito derivado da abstração dessas situações. Grafos são uma poderosa representação estrutural que codifica relações entre objetos e entre seus componentes em um único formalismo. Essa representação é tão poderosa que é aplicada em uma ampla gama de aplicações, de bioinformática a redes sociais. Dessa maneira, diversos problemas de reconhecimento de padrões são modelados para utilizar representações baseadas em grafos. Em problemas de classificação, os relacionamentos presentes entre objetos ou entre seus componentes são explorados para obter soluções efetivas e/ou eficientes. Nesta tese, nós investigamos o uso de grafos em problemas de classificação. Nós propomos duas linhas de pesquisa na tese: 1) uma representação baseada em grafos associados a objetos multi-modais; e 2) uma abordagem baseada em aprendizado para identificar correspondências entre grafos. Inicialmente, nós investigamos o uso do método Sacola de Grafos Visuais para representar regiões na classificação de imagens de sensoriamento remoto, considerando a distribuição espacial de pontos de interesse dentro da imagem. Quando é feita a combinação de representações de cores e textura, nós obtivemos resultados efetivos em duas bases de dados da literatura (Monte Santo e Campinas). Em segundo lugar, nós propomos duas novas extensões do método de Sacola de Grafos para a representação de objetos multi-modais. Ao utilizar essas abordagens, nós combinamos visões complementares de diferentes modalidades (por exemplo, descrições visuais e textuais). Nós validamos o uso dessas abordagens no problema de detecção de enchentes proposto pela iniciativa MediaEval, obtendo 86,9\% de acurácia nos 50 primeiros resultados retornados. Nós abordamos o problema de corresponência de grafos ao propor um arcabouço original para aprender a função de custo no método de distância de edição de grafos. Nós também apresentamos algumas implementações utilizando métodos de reconhecimento em cenário aberto e medidas de redes complexas para caracterizar propriedades locais de grafos. Até onde sabemos, nós fomos os primeiros a tratar o processo de aprendizado de custo como um problema de reconhecimento em cenário aberto e os primeiros a explorar medidas de redes complexas em tais problemas. Nós obtivemos resultados efetivos, que são comparáveis a diversos métodos da literatura em problemas de classificação de grafosAbstract: Many real-world situations can be modeled through objects and their relationships, like the roads connecting cities in a map. Graph is a concept derived from the abstraction of these situations. Graphs are a powerful structural representation, which encodes relationship among objects and among their components into a single formalism. This representation is so powerful that it is applied to a wide range of applications, ranging from bioinformatics to social networks. Thus, several pattern recognition problems are modeled to use graph-based representations. In classification problems, the relationships among objects or among their components are exploited to achieve effective and/or efficient solutions. In this thesis, we investigate the use of graphs in classification problems. Two research venues are followed: 1) proposal of graph-based multimodal object representations; and 2) proposal of learning-based approaches to support graph matching. Firstly, we investigated the use of the recently proposed Bag-of-Visual-Graphs method in the representation of regions in a remote sensing classification problem, considering the spatial distribution of interest points within the image. When we combined color and texture representations, we obtained effective results in two datasets of the literature (Monte Santo and Campinas). Secondly, we proposed two new extensions of the Bag-of-Graphs method to the representation of multimodal objects. By using these approaches, we can combine complementary views of different modalities (e.g., visual and textual descriptions). We validated the use of these approaches in the flooding detection problem proposed by the MediaEval initiative, achieving 86.9\% of accuracy at the Precision@50. We addressed the graph matching problem by proposing an original framework to learn the cost function in a graph edit distance method. We also presented a couple of formulations using open-set recognition methods and complex network measurements to characterize local graph properties. To the best of our knowledge, we were the first to conduct the cost learning process as an open-set recognition problem and to exploit complex network measurements in such problems. We have achieved effective results, which are comparable to several baselines in graph classification problemsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2016/18429-141584/2016-5CAPESFAPESPCNP

    Hacking Antarctica

    Get PDF
    Hacking Antarctica is an investigation focused on rendering aesthetic responses to Antarctica beyond normative representations of the sublime and the imperceptible. It is based on fieldwork in polar and subpolar areas over the last 9 years. At its core, the research uses Immanuel Kant’s Critique of Judgement as a way of understanding what is meant by the sublime and from that develops a practice that examines what a Kantian lack of access to nature implies. This key Kantian concept is explained and devised into art works and then tested through concepts such as translation, transduction, infection and representation, using hacking methodologies informed by bricolage (L ´ evi-Strauss 1968), and diffraction (Barad 2007). The research expands on the taxonomies of the polar to reconsider the Antarctic as a border and periphery, bringing a conjunction of hacking methods and site-specific art that enables a performative causality with which to study the production of site. In other words, a performative approach as Barad and other feminist writers recognize, is questioning the traditional causality of ends and means and observer and observed and rather focuses on processes within discursive practices. Causality is reworked as a local externalization of the intra-acting relations of matter. Within the overall system of research for Antarctica, technical methods used included; Free Libre Open Source software and hardware techniques, black and white and infrared photography, ultraviolet light sensing, sound recordings, hydrophone recordings, very low frequency recordings, AM radio sensing, error in photography (light leakage, displaced focus), in text (cut-up compositions), in video (glitch) and error in bodies as infections; bio-sensing agents (including yeast and lactobacillus), point-array analysis, translation of images to raw data, and from raw data to sound, land art performances, spatialization of sound, stereo panning, quadraphonic sound, interactive embroidery, radio broadcast and installations. Specific outputs include: Antarctica 1961-1986 (2017), an interactive embroidered map of Antarctica showing sites of mineral sources and mineral pollution. The map was installed as an interactive instrument that allowed visitors to participate in the live shaping of the spatialization of sounds recorded in Antarctica. A digital Theremin sensor attached to the map was interfaced with Pure Data software running on a GNU-Linux Debian station. All software was made visible as well as the papers documenting the traces of the plutonium found there. The research through an experimental set of hacking practices supported the hypothesis that Antarctica can be represented outside the sublime through the polar-site produced by hashes of proxies and the diffraction produced when superposing modes of knowing. The interruption of the spectacle to respond to Antarctica is the result

    Deep Learning for Mobile Multimedia: A Survey

    Get PDF
    Deep Learning (DL) has become a crucial technology for multimedia computing. It offers a powerful instrument to automatically produce high-level abstractions of complex multimedia data, which can be exploited in a number of applications, including object detection and recognition, speech-to- text, media retrieval, multimodal data analysis, and so on. The availability of affordable large-scale parallel processing architectures, and the sharing of effective open-source codes implementing the basic learning algorithms, caused a rapid diffusion of DL methodologies, bringing a number of new technologies and applications that outperform, in most cases, traditional machine learning technologies. In recent years, the possibility of implementing DL technologies on mobile devices has attracted significant attention. Thanks to this technology, portable devices may become smart objects capable of learning and acting. The path toward these exciting future scenarios, however, entangles a number of important research challenges. DL architectures and algorithms are hardly adapted to the storage and computation resources of a mobile device. Therefore, there is a need for new generations of mobile processors and chipsets, small footprint learning and inference algorithms, new models of collaborative and distributed processing, and a number of other fundamental building blocks. This survey reports the state of the art in this exciting research area, looking back to the evolution of neural networks, and arriving to the most recent results in terms of methodologies, technologies, and applications for mobile environments
    corecore