7 research outputs found

    A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions

    Get PDF
    Image captioning is a task that can provide a description of an image in natural language. Image captioning can be used for a variety of applications, such as image indexing and virtual assistants. In this research, we compared the performance of three different word embeddings, namely, GloVe, Word2Vec, FastText and six CNN-based feature extraction architectures such as, Inception V3, InceptionResNet V2, ResNet152 V2, EfficientNet B3 V1, EfficientNet B7 V1, and NASNetLarge which then will be combined with LSTM as the decoder to perform image captioning. We used ten different household objects (bed, cell phone, chair, couch, oven, potted plant, refrigerator, sink, table, and tv) that were obtained from MSCOCO dataset to develop the model. Then, we created five new captions in Bahasa Indonesia for the selected images. The captions might contain details about the name, the location, the color, the size, and the characteristics of an object and its surrounding area. In our 18 experimental models, we used different combination of the word embedding and CNN-based feature extraction architecture, along with LSTM to train the model. As the result, models that used the combination of Word2Vec + NASNetLarge performed better in generating Indonesian captions than the other models based on BLEU-4 metric

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    Exploring the Internal Statistics: Single Image Super-Resolution, Completion and Captioning

    Full text link
    Image enhancement has drawn increasingly attention in improving image quality or interpretability. It aims to modify images to achieve a better perception for human visual system or a more suitable representation for further analysis in a variety of applications such as medical imaging, remote sensing, and video surveillance. Based on different attributes of the given input images, enhancement tasks vary, e.g., noise removal, deblurring, resolution enhancement, prediction of missing pixels, etc. The latter two are usually referred to as image super-resolution and image inpainting (or completion). Image super-resolution and completion are numerically ill-posed problems. Multi-frame-based approaches make use of the presence of aliasing in multiple frames of the same scene. For cases where only one input image is available, it is extremely challenging to estimate the unknown pixel values. In this dissertation, we target at single image super-resolution and completion by exploring the internal statistics within the input image and across scales. An internal gradient similarity-based single image super-resolution algorithm is first presented. Then we demonstrate that the proposed framework could be naturally extended to accomplish super-resolution and completion simultaneously. Afterwards, a hybrid learning-based single image super-resolution approach is proposed to benefit from both external and internal statistics. This framework hinges on image-level hallucination from externally learned regression models as well as gradient level pyramid self-awareness for edges and textures refinement. The framework is then employed to break the resolution limitation of the passive microwave imagery and to boost the tracking accuracy of the sea ice movements. To extend our research to the quality enhancement of the depth maps, a novel system is presented to handle circumstances where only one pair of registered low-resolution intensity and depth images are available. High quality RGB and depth images are generated after the system. Extensive experimental results have demonstrated the effectiveness of all the proposed frameworks both quantitatively and qualitatively. Different from image super-resolution and completion which belong to low-level vision research, image captioning is a high-level vision task related to the semantic understanding of an input image. It is a natural task for human beings. However, image captioning remains challenging from a computer vision point of view especially due to the fact that the task itself is ambiguous. In principle, descriptions of an image can talk about any visual aspects in it varying from object attributes to scene features, or even refer to objects that are not depicted and the hidden interaction or connection that requires common sense knowledge to analyze. Therefore, learning-based image captioning is in general a data-driven task, which relies on the training dataset. Descriptions in the majority of the existing image-sentence datasets are generated by humans under specific instructions. Real-world sentence data is rarely directly utilized for training since it is sometimes noisy and unbalanced, which makes it ‘imperfect’ for the training of the image captioning task. In this dissertation, we present a novel image captioning framework to deal with the uncontrolled image-sentence dataset where descriptions could be strongly or weakly correlated to the image content and in arbitrary lengths. A self-guiding learning process is proposed to fully reveal the internal statistics of the training dataset and to look into the learning process in a global way and generate descriptions that are syntactically correct and semantically sound
    corecore