46,470 research outputs found
Deep Visual City Recognition Visualization
Understanding how cities visually differ from each others is interesting for
planners, residents, and historians. We investigate the interpretation of deep
features learned by convolutional neural networks (CNNs) for city recognition.
Given a trained city recognition network, we first generate weighted masks
using the known Grad-CAM technique and to select the most discriminate regions
in the image. Since the image classification label is the city name, it
contains no information of objects that are class-discriminate, we investigate
the interpretability of deep representations with two methods. (i) Unsupervised
method is used to cluster the objects appearing in the visual explanations.
(ii) A pretrained semantic segmentation model is used to label objects in pixel
level, and then we introduce statistical measures to quantitatively evaluate
the interpretability of discriminate objects. The influence of network
architectures and random initializations in training, is studied on the
interpretability of CNN features for city recognition. The results suggest that
network architectures would affect the interpretability of learned visual
representations greater than different initializations.Comment: CVPR-19 workshop on Explainable A
Revisiting IM2GPS in the Deep Learning Era
Image geolocalization, inferring the geographic location of an image, is a
challenging computer vision problem with many potential applications. The
recent state-of-the-art approach to this problem is a deep image classification
approach in which the world is spatially divided into cells and a deep network
is trained to predict the correct cell for a given image. We propose to combine
this approach with the original Im2GPS approach in which a query image is
matched against a database of geotagged images and the location is inferred
from the retrieved set. We estimate the geographic location of a query image by
applying kernel density estimation to the locations of its nearest neighbors in
the reference database. Interestingly, we find that the best features for our
retrieval task are derived from networks trained with classification loss even
though we do not use a classification approach at test time. Training with
classification loss outperforms several deep feature learning methods (e.g.
Siamese networks with contrastive of triplet loss) more typical for retrieval
applications. Our simple approach achieves state-of-the-art geolocalization
accuracy while also requiring significantly less training data
Changing Fashion Cultures
The paper presents a novel concept that analyzes and visualizes worldwide
fashion trends. Our goal is to reveal cutting-edge fashion trends without
displaying an ordinary fashion style. To achieve the fashion-based analysis, we
created a new fashion culture database (FCDB), which consists of 76 million
geo-tagged images in 16 cosmopolitan cities. By grasping a fashion trend of
mixed fashion styles,the paper also proposes an unsupervised fashion trend
descriptor (FTD) using a fashion descriptor, a codeword vetor, and temporal
analysis. To unveil fashion trends in the FCDB, the temporal analysis in FTD
effectively emphasizes consecutive features between two different times. In
experiments, we clearly show the analysis of fashion trends and fashion-based
city similarity. As the result of large-scale data collection and an
unsupervised analyzer, the proposed approach achieves world-level fashion
visualization in a time series. The code, model, and FCDB will be publicly
available after the construction of the project page.Comment: 9 pages, 9 figure
A new approach for pedestrian density estimation using moving sensors and computer vision
An understanding of pedestrian dynamics is indispensable for numerous urban
applications including the design of transportation networks and planing for
business development. Pedestrian counting often requires utilizing manual or
technical means to count individuals in each location of interest. However,
such methods do not scale to the size of a city and a new approach to fill this
gap is here proposed. In this project, we used a large dense dataset of images
of New York City along with computer vision techniques to construct a
spatio-temporal map of relative person density. Due to the limitations of state
of the art computer vision methods, such automatic detection of person is
inherently subject to errors. We model these errors as a probabilistic process,
for which we provide theoretical analysis and thorough numerical simulations.
We demonstrate that, within our assumptions, our methodology can supply a
reasonable estimate of person densities and provide theoretical bounds for the
resulting error.Comment: Submitted to ACM-TSA
Fine-Grained Land Use Classification at the City Scale Using Ground-Level Images
We perform fine-grained land use mapping at the city scale using ground-level
images. Mapping land use is considerably more difficult than mapping land cover
and is generally not possible using overhead imagery as it requires close-up
views and seeing inside buildings. We postulate that the growing collections of
georeferenced, ground-level images suggest an alternate approach to this
geographic knowledge discovery problem. We develop a general framework that
uses Flickr images to map 45 different land-use classes for the City of San
Francisco. Individual images are classified using a novel convolutional neural
network containing two streams, one for recognizing objects and another for
recognizing scenes. This network is trained in an end-to-end manner directly on
the labeled training images. We propose several strategies to overcome the
noisiness of our user-generated data including search-based training set
augmentation and online adaptive training. We derive a ground truth map of San
Francisco in order to evaluate our method. We demonstrate the effectiveness of
our approach through geo-visualization and quantitative analysis. Our framework
achieves over 29% recall at the individual land parcel level which represents a
strong baseline for the challenging 45-way land use classification problem
especially given the noisiness of the image data
Learning to Interpret Satellite Images in Global Scale Using Wikipedia
Despite recent progress in computer vision, finegrained interpretation of
satellite images remains challenging because of a lack of labeled training
data. To overcome this limitation, we construct a novel dataset called
WikiSatNet by pairing georeferenced Wikipedia articles with satellite imagery
of their corresponding locations. We then propose two strategies to learn
representations of satellite images by predicting properties of the
corresponding articles from the images. Leveraging this new multi-modal
dataset, we can drastically reduce the quantity of human-annotated labels and
time required for downstream tasks. On the recently released fMoW dataset, our
pre-training strategies can boost the performance of a model pre-trained on
ImageNet by up to 4:5% in F1 score.Comment: Accepted to IJCAI 201
StreetStyle: Exploring world-wide clothing styles from millions of photos
Each day billions of photographs are uploaded to photo-sharing services and
social media platforms. These images are packed with information about how
people live around the world. In this paper we exploit this rich trove of data
to understand fashion and style trends worldwide. We present a framework for
visual discovery at scale, analyzing clothing and fashion across millions of
images of people around the world and spanning several years. We introduce a
large-scale dataset of photos of people annotated with clothing attributes, and
use this dataset to train attribute classifiers via deep learning. We also
present a method for discovering visually consistent style clusters that
capture useful visual correlations in this massive dataset. Using these tools,
we analyze millions of photos to derive visual insight, producing a
first-of-its-kind analysis of global and per-city fashion choices and
spatio-temporal trends
Learning Photography Aesthetics with Deep CNNs
Automatic photo aesthetic assessment is a challenging artificial intelligence
task. Existing computational approaches have focused on modeling a single
aesthetic score or a class (good or bad), however these do not provide any
details on why the photograph is good or bad, or which attributes contribute to
the quality of the photograph. To obtain both accuracy and human interpretation
of the score, we advocate learning the aesthetic attributes along with the
prediction of the overall score. For this purpose, we propose a novel multitask
deep convolution neural network, which jointly learns eight aesthetic
attributes along with the overall aesthetic score. We report near human
performance in the prediction of the overall aesthetic score. To understand the
internal representation of these attributes in the learned model, we also
develop the visualization technique using back propagation of gradients. These
visualizations highlight the important image regions for the corresponding
attributes, thus providing insights about model's representation of these
attributes. We showcase the diversity and complexity associated with different
attributes through a qualitative analysis of the activation maps.Comment: Accepted in The 28th Modern Artificial Intelligence and Cognitive
Science Conferenc
Directional Statistics-based Deep Metric Learning for Image Classification and Retrieval
Deep distance metric learning (DDML), which is proposed to learn image
similarity metrics in an end-to-end manner based on the convolution neural
network, has achieved encouraging results in many computer vision
tasks.-normalization in the embedding space has been used to improve the
performance of several DDML methods. However, the commonly used Euclidean
distance is no longer an accurate metric for -normalized embedding space,
i.e., a hyper-sphere. Another challenge of current DDML methods is that their
loss functions are usually based on rigid data formats, such as the triplet
tuple. Thus, an extra process is needed to prepare data in specific formats. In
addition, their losses are obtained from a limited number of samples, which
leads to a lack of the global view of the embedding space. In this paper, we
replace the Euclidean distance with the cosine similarity to better utilize the
-normalization, which is able to attenuate the curse of dimensionality.
More specifically, a novel loss function based on the von Mises-Fisher
distribution is proposed to learn a compact hyper-spherical embedding space.
Moreover, a new efficient learning algorithm is developed to better capture the
global structure of the embedding space. Experiments for both classification
and retrieval tasks on several standard datasets show that our method achieves
state-of-the-art performance with a simpler training procedure. Furthermore, we
demonstrate that, even with a small number of convolutional layers, our model
can still obtain significantly better classification performance than the
widely used softmax loss.Comment: codes will come soo
An Interactive Insight Identification and Annotation Framework for Power Grid Pixel Maps using DenseU-Hierarchical VAE
Insights in power grid pixel maps (PGPMs) refer to important facility
operating states and unexpected changes in the power grid. Identifying insights
helps analysts understand the collaboration of various parts of the grid so
that preventive and correct operations can be taken to avoid potential
accidents. Existing solutions for identifying insights in PGPMs are performed
manually, which may be laborious and expertise-dependent. In this paper, we
propose an interactive insight identification and annotation framework by
leveraging an enhanced variational autoencoder (VAE). In particular, a new
architecture, DenseU-Hierarchical VAE (DUHiV), is designed to learn
representations from large-sized PGPMs, which achieves a significantly tighter
evidence lower bound (ELBO) than existing Hierarchical VAEs with a Multilayer
Perceptron architecture. Our approach supports modulating the derived
representations in an interactive visual interface, discover potential insights
and create multi-label annotations. Evaluations using real-world PGPMs datasets
show that our framework outperforms the baseline models in identifying and
annotating insights
- …