5 research outputs found
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
We introduce a method to train vision-language models for remote-sensing
images without using any textual annotations. Our key insight is to use
co-located internet imagery taken on the ground as an intermediary for
connecting remote-sensing images and language. Specifically, we train an image
encoder for remote sensing images to align with the image encoder of CLIP using
a large amount of paired internet and satellite images. Our unsupervised
approach enables the training of a first-of-its-kind large-scale vision
language model (VLM) for remote sensing images at two different resolutions. We
show that these VLMs enable zero-shot, open-vocabulary image classification,
retrieval, segmentation and visual question answering for satellite images. On
each of these tasks, our VLM trained without textual annotations outperforms
existing VLMs trained with supervision, with gains of up to 20% for
classification and 80% for segmentation
Visual Discovery from Spatio-Temporal Imagery
300 pagesFrom social media to street view and all the way to satellite images, we are capturing visual data at an unprecedented scale. These images tell a story about our planet. With advances in automatic recognition, we can build a collective understanding of world-scale events as recorded through visual media. Such insights have the potential to be useful for various experts in their domain such as cultural anthropologists and ecologists. However, discovering such rare yet interesting insights from the data is very challenging. First, it requires recognition models that have an expert-level understanding of such visual domains. Second, it requires tools that can leverage such models and large-scale spatio-temporal data and discover novel insights. In this dissertation, we first look at ways of building and improving automatic recognition models in such expert domains. More specifically we look at how we can efficiently learn a representation for such domains with either no supervision or with text or attribute-based supervision. We specifically work with domains that require expertise for understanding such as ornithology or remote sensing. These methods not only aim to make the recognition models more cost-efficient but also more practical to be used with experts. More specifically we present an unsupervised method to learn representation in the satellite image domain. Then we look at an attribute-based model for bird classification (and other attribute-based domains) and introduce ways to make it more practical and label-efficient to work with. We then present methods that can discover novel insights without any supervision by looking at large-scale spatio-temporal visual data. These methods make use of domain-specific vision models to make the discovery. More specifically, we use these methods to understand fashion trends and discover cultural phenomena and social events around the world by looking at fashion images from social media. Broadening our domain to include satellite imagery, we introduce completely unsupervised techniques to discover interesting change events across the planet from satellite images. This general framework can be potentially applied in different visual domains ranging from sustainability to online commerce to discover interesting phenomena in those domains