14,088 research outputs found
Distinctive-attribute Extraction for Image Captioning
Image captioning, an open research issue, has been evolved with the progress
of deep neural networks. Convolutional neural networks (CNNs) and recurrent
neural networks (RNNs) are employed to compute image features and generate
natural language descriptions in the research. In previous works, a caption
involving semantic description can be generated by applying additional
information into the RNNs. In this approach, we propose a distinctive-attribute
extraction (DaE) which explicitly encourages significant meanings to generate
an accurate caption describing the overall meaning of the image with their
unique situation. Specifically, the captions of training images are analyzed by
term frequency-inverse document frequency (TF-IDF), and the analyzed semantic
information is trained to extract distinctive-attributes for inferring
captions. The proposed scheme is evaluated on a challenge data, and it improves
an objective performance while describing images in more detail.Comment: 14 main pages, 4 supplementary page
Portable extraction of partially structured facts from the web
A novel fact extraction task is defined to fill a gap between current information retrieval and information extraction technologies. It is shown that it is possible to extract useful partially structured facts about different kinds of entities in a broad domain, i.e. all kinds of places depicted in tourist images. Importantly the approach does not rely on existing linguistic resources (gazetteers, taggers, parsers, etc.) and it ported easily and cheaply between two very different languages (English and Latvian). Previous fact extraction from the web has focused on the extraction of structured data, e.g. (Building-LocatedIn-Town). In contrast we extract richer and more interesting facts, such as a fact explaining why a building was built. Enough structure is maintained to facilitate subsequent processing of the information. For example, this partial structure enables straightforward template-based text generation. We report positive results for the correctness and interest of English and Latvian facts and for the utility of the extracted facts in enhancing image captions
Unsupervised Text Extraction from G-Maps
This paper represents an text extraction method from Google maps, GIS
maps/images. Due to an unsupervised approach there is no requirement of any
prior knowledge or training set about the textual and non-textual parts. Fuzzy
CMeans clustering technique is used for image segmentation and Prewitt method
is used to detect the edges. Connected component analysis and gridding
technique enhance the correctness of the results. The proposed method reaches
98.5% accuracy level on the basis of experimental data sets.Comment: Proc. IEEE Conf. #30853, International Conference on Human Computer
Interactions (ICHCI'13), Chennai, India, 23-24 Aug., 201
Human Attention in Image Captioning: Dataset and Analysis
In this work, we present a novel dataset consisting of eye movements and
verbal descriptions recorded synchronously over images. Using this data, we
study the differences in human attention during free-viewing and image
captioning tasks. We look into the relationship between human attention and
language constructs during perception and sentence articulation. We also
analyse attention deployment mechanisms in the top-down soft attention approach
that is argued to mimic human attention in captioning tasks, and investigate
whether visual saliency can help image captioning. Our study reveals that (1)
human attention behaviour differs in free-viewing and image description tasks.
Humans tend to fixate on a greater variety of regions under the latter task,
(2) there is a strong relationship between described objects and attended
objects ( of the described objects are being attended), (3) a
convolutional neural network as feature encoder accounts for human-attended
regions during image captioning to a great extent (around ), (4)
soft-attention mechanism differs from human attention, both spatially and
temporally, and there is low correlation between caption scores and attention
consistency scores. These indicate a large gap between humans and machines in
regards to top-down attention, and (5) by integrating the soft attention model
with image saliency, we can significantly improve the model's performance on
Flickr30k and MSCOCO benchmarks. The dataset can be found at:
https://github.com/SenHe/Human-Attention-in-Image-Captioning.Comment: To appear at ICCV 201
- …