7 research outputs found
Image Captioning menurut Scientific Revolution Kuhn dan Popper
Image captioning is one area in artificial intelligence that elaborates between computer vision and natural language processing. The focus on this process is an architecture neural network that includes many layers to solve the identification object on the image and give the caption. This architecture has a task to display the caption from object detection on one image. This paper explains about the connection between scientific revolution and image captioning. We have conducted the methodology by Kuhn's scientific revolution and relate to Popper's philosophy of science. The result of this paper is that an image captioning is truly science because many improvements from many researchers to find an effective method on the deep learning process. On the philosophy of science, if the phenomena can be falsified, then an image captioning is the science
Non-Autoregressive Coarse-to-Fine Video Captioning
It is encouraged to see that progress has been made to bridge videos and
natural language. However, mainstream video captioning methods suffer from slow
inference speed due to the sequential manner of autoregressive decoding, and
prefer generating generic descriptions due to the insufficient training of
visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this
paper, we propose a non-autoregressive decoding based model with a
coarse-to-fine captioning procedure to alleviate these defects. In
implementations, we employ a bi-directional self-attention based network as our
language model for achieving inference speedup, based on which we decompose the
captioning procedure into two stages, where the model has different focuses.
Specifically, given that visual words determine the semantic correctness of
captions, we design a mechanism of generating visual words to not only promote
the training of scene-related words but also capture relevant details from
videos to construct a coarse-grained sentence "template". Thereafter, we devise
dedicated decoding algorithms that fill in the "template" with suitable words
and modify inappropriate phrasing via iterative refinement to obtain a
fine-grained description. Extensive experiments on two mainstream video
captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach
achieves state-of-the-art performance, generates diverse descriptions, and
obtains high inference efficiency. Our code is available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captioning.Comment: 9 pages, 6 figures, to be published in AAAI2021. Our code is
available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captionin