162 research outputs found
Automated annotation of landmark images using community contributed datasets and web resources
A novel solution to the challenge of automatic image annotation is described. Given an image with GPS data of its location of capture, our system returns a semantically-rich annotation comprising tags which both identify the landmark in the image, and provide an interesting fact about it, e.g. "A view of the Eiffel Tower, which was built in 1889 for an international exhibition in Paris". This exploits visual and textual web mining in combination with content-based image
analysis and natural language processing. In the first stage, an input image is matched to a set of community contributed images (with keyword tags) on the basis of its GPS information and image classification techniques. The depicted landmark is inferred from the keyword tags for the matched set. The system then takes advantage of the information written about landmarks available on the web at large to extract a fact about the landmark in the image. We report component evaluation results from an implementation of our solution on a mobile device. Image localisation and matching oers 93.6% classication accuracy; the selection of appropriate tags for use in annotation performs well (F1M of
0.59), and it subsequently automatically identies a correct toponym for use in captioning and fact extraction in 69.0% of the tested cases; finally the fact extraction returns an interesting caption in 78% of cases
An Autoencoder-Based Image Descriptor for Image Matching and Retrieval
Local image features are used in many computer vision applications. Many point detectors and descriptors have been proposed in recent years; however, creation of effective descriptors is still a topic of research. The Scale Invariant Feature Transform (SIFT) developed by David Lowe is widely used in image matching and image retrieval. SIFT detects interest points in an image based on Scale-Space analysis, which is invariant to change in image scale. A SIFT descriptor contains gradient information about an image patch centered at a point of interest. SIFT is found to provide a high matching rate, is robust to image transformations; however, it is found to be slow in image matching/retrieval. Autoencoder is a method for representation learning and is used in this project to construct a low-dimensional representation of a high-dimensional data while preserving the structure and geometry of the data. In many computer vision tasks, the high dimensionality of input data means a high computational cost. The main motivation in this project is to improve the speed and the distinctness of SIFT descriptors. To achieve this, a new descriptor is proposed that is based on Autoencoder. Our newly generated descriptors can reduce the size and complexity of SIFT descriptors, reducing the time required in image matching and image retrieval
MIDMs: Matching Interleaved Diffusion Models for Exemplar-based Image Translation
We present a novel method for exemplar-based image translation, called
matching interleaved diffusion models (MIDMs). Most existing methods for this
task were formulated as GAN-based matching-then-generation framework. However,
in this framework, matching errors induced by the difficulty of semantic
matching across cross-domain, e.g., sketch and photo, can be easily propagated
to the generation step, which in turn leads to degenerated results. Motivated
by the recent success of diffusion models overcoming the shortcomings of GANs,
we incorporate the diffusion models to overcome these limitations.
Specifically, we formulate a diffusion-based matching-and-generation framework
that interleaves cross-domain matching and diffusion steps in the latent space
by iteratively feeding the intermediate warp into the noising process and
denoising it to generate a translated image. In addition, to improve the
reliability of the diffusion process, we design a confidence-aware process
using cycle-consistency to consider only confident regions during translation.
Experimental results show that our MIDMs generate more plausible images than
state-of-the-art methods
Rethinking the sGLOH descriptor
sGLOH (shifting GLOH) is a histogram-based keypoint descriptor that can be associated to multiple quantized rotations of the keypoint patch without any recomputation. This property can be exploited to define the best distance between two descriptor vectors, thus avoiding computing the dominant orientation. In addition, sGLOH can reject incongruous correspondences by adding a global constraint on the rotations either as an a priori knowledge or based on the data. This paper thoroughly reconsiders sGLOH and improves it in terms of robustness, speed and descriptor dimension. The revised sGLOH embeds more quantized rotations, thus yielding more correct matches. A novel fast matching scheme is also designed, which significantly reduces both computation time and memory usage. In addition, a new binarization technique based on comparisons inside each descriptor histogram is defined, yielding a more compact, faster, yet robust alternative. Results on an exhaustive comparative experimental evaluation show that the revised sGLOH descriptor incorporating the above ideas and combining them according to task requirements, improves in most cases the state of the art in both image matching and object recognition
頑健な画像間対応付け及び視覚的位置推定のための深層学習手法
Tohoku University博士(情報科学)thesi
Towards Content-based Pixel Retrieval in Revisited Oxford and Paris
This paper introduces the first two pixel retrieval benchmarks. Pixel
retrieval is segmented instance retrieval. Like semantic segmentation extends
classification to the pixel level, pixel retrieval is an extension of image
retrieval and offers information about which pixels are related to the query
object. In addition to retrieving images for the given query, it helps users
quickly identify the query object in true positive images and exclude false
positive images by denoting the correlated pixels. Our user study results show
pixel-level annotation can significantly improve the user experience.
Compared with semantic and instance segmentation, pixel retrieval requires a
fine-grained recognition capability for variable-granularity targets. To this
end, we propose pixel retrieval benchmarks named PROxford and PRParis, which
are based on the widely used image retrieval datasets, ROxford and RParis.
Three professional annotators label 5,942 images with two rounds of
double-checking and refinement. Furthermore, we conduct extensive experiments
and analysis on the SOTA methods in image search, image matching, detection,
segmentation, and dense matching using our pixel retrieval benchmarks. Results
show that the pixel retrieval task is challenging to these approaches and
distinctive from existing problems, suggesting that further research can
advance the content-based pixel-retrieval and thus user search experience. The
datasets can be downloaded from
\href{https://github.com/anguoyuan/Pixel_retrieval-Segmented_instance_retrieval}{this
link}
Visual Place Recognition under Severe Viewpoint and Appearance Changes
Over the last decade, the eagerness of the robotic and computer vision research communities unfolded extensive advancements in long-term robotic vision. Visual localization is the constituent of this active research domain; an ability of an object to correctly localize itself while mapping the environment simultaneously, technically termed as Simultaneous Localization and Mapping (SLAM).
Visual Place Recognition (VPR), a core component of SLAM is a well-known paradigm. In layman terms, at a certain place/location within an environment, a robot needs to decide whether it’s the same place experienced before? Visual Place Recognition utilizing Convolutional Neural Networks (CNNs) has made a major contribution in the last few years. However, the image retrieval-based VPR becomes more challenging when the same places experience strong viewpoint and seasonal transitions. This thesis concentrates on improving the retrieval performance of VPR system, generally targeting the place correspondence.
Despite the remarkable performances of state-of-the-art deep CNNs for VPR, the significant computation- and memory-overhead limit their practical deployment for resource constrained mobile robots. This thesis investigates the utility of shallow CNNs for power-efficient VPR applications. The proposed VPR frameworks focus on novel image regions that can contribute in recognizing places under dubious environment and viewpoint variations.
Employing challenging place recognition benchmark datasets, this thesis further illustrates and evaluates the robustness of shallow CNN-based regional features against viewpoint and appearance changes coupled with dynamic instances, such as pedestrians, vehicles etc. Finally, the presented computation-efficient and light-weight VPR methodologies have shown boostup in matching performance in terms of Area under Precision-Recall curves (AUC-PR curves) over state-of-the-art deep neural network based place recognition and SLAM algorithms
- …