6 research outputs found

    Image Captioning menurut Scientific Revolution Kuhn dan Popper

    Get PDF
    Image captioning is one area in artificial intelligence that elaborates between computer vision and natural language processing. The focus on this process is an architecture neural network that includes many layers to solve the identification object on the image and give the caption. This architecture has a task to display the caption from object detection on one image. This paper explains about the connection between scientific revolution and image captioning. We have conducted the methodology by Kuhn's scientific revolution and relate to Popper's philosophy of science. The result of this paper is that an image captioning is truly science because many improvements from many researchers to find an effective method on the deep learning process. On the philosophy of science, if the phenomena can be falsified, then an image captioning is the science

    Large scale datasets for Image and Video Captioning in Italian

    Get PDF
    The application of Attention-based Deep Neural architectures to the automatic captioning of images and videos is enabling the development of increasingly performing systems. Unfortunately, while image processing is language independent, this does not hold for caption generation. Training such architectures requires the availability of (possibly large-scale) language specific resources, which are not available for many languages, such as Italian.In this paper, we present MSCOCO-it e MSR-VTT-it, two large-scale resources for image and video captioning. They have been derived by applying automatic machine translation to existing resources. Even though this approach is naive and exposed to the gathering of noisy information (depending on the quality of the automatic translator), we experimentally show that robust deep learning is enabled, rather tolerant with respect to such noise. In particular, we improve the state-of-the-art results with respect to image captioning in Italian. Moreover, in the paper we discuss the training of a system that, at the best of our knowledge, is the first video captioning system in Italian

    A news image captioning approach based on multimodal pointer-generator network

    Get PDF
    News image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer-generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer-generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word-image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out-of-domain BBC test dataset show that the proposed model outperforms the original pointer-generator network, the generic image captioning method, the extractive news image captioning method, and the LDA-based method according BLEU, METEOR, and ROUGL-L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model

    Neural models for stepwise text illustration

    Get PDF
    In this thesis, we investigate the task of sequence-to-sequence (seq2seq) retrieval: given a sequence (of text passages) as the query, retrieve a sequence (of images) that best describes and aligns with the query. This is a step beyond the traditional cross-modal retrieval which treats each image-text pair independently and ignores broader context. Since this is a difficult task, we break it into steps. We start with caption generation for images in news articles. Different from traditional image captioning task where a text description is generated given an image, here, a caption is generated conditional on both image and the news articles where it appears. We propose a novel neural-networks based methodology to take into account both news article content and image semantics to generate a caption best describing the image and its surrounding text context. Our results outperform existing approaches to image captioning generation. We then introduce two new novel datasets, GutenStories and Stepwise Recipe datasets for the task of story picturing and sequential text illustration. GutenStories consists of around 90k text paragraphs, each accompanied with an image, aligned in around 18k visual stories. It consists of a wide variety of images and story content styles. StepwiseRecipe is a similar dataset having sequenced image-text pairs, but having only domain-constrained images, namely food-related. It consists of 67k text paragraphs (cooking instructions), each accompanied by an image describing the step, aligned in 10k recipes. Both datasets are web-scrawled and systematically filtered and cleaned. We propose a novel variational recurrent seq2seq (VRSS) retrieval model. xii The model encodes two streams of information at every step: the contextual information from both text and images retrieved in previous steps, and the semantic meaning of the current input (text) as a latent vector. These together guide the retrieval of a relevant image from the repository to match the semantics of the given text. The model has been evaluated on both the Stepwise Recipe and GutenStories datasets. The results on several automatic evaluation measures show that our model outperforms several competitive and relevant baselines. We also qualitatively analyse the model both using human evaluation and by visualizing the representation space to judge the semantical meaningfulness. We further discuss the challenges faced on the more difficult GutenStories and outline possible solutions

    FrameNet annotation for multimodal corpora: devising a methodology for the semantic representation of text-image interactions in audiovisual productions

    Get PDF
    Multimodal analyses have been growing in importance within several approaches to Cognitive Linguistics and applied fields such as Natural Language Understanding. Nonetheless fine-grained semantic representations of multimodal objects are still lacking, especially in terms of integrating areas such as Natural Language Processing and Computer Vision, which are key for the implementation of multimodality in Computational Linguistics. In this dissertation, we propose a methodology for extending FrameNet annotation to the multimodal domain, since FrameNet can provide fine-grained semantic representations, particularly with a database enriched by Qualia and other interframal and intraframal relations, as it is the case of FrameNet Brasil. To make FrameNet Brasil able to conduct multimodal analysis, we outlined the hypothesis that similarly to the way in which words in a sentence evoke frames and organize their elements in the syntactic locality accompanying them, visual elements in video shots may, also, evoke frames and organize their elements on the screen or work complementarily with the frame evocation patterns of the sentences narrated simultaneously to their appearance on screen, providing different profiling and perspective options for meaning construction. The corpus annotated for testing the hypothesis is composed of episodes of a Brazilian TV Travel Series critically acclaimed as an exemplar of good practices in audiovisual composition. The TV genre chosen also configures a novel experimental setting for research on integrated image and text comprehension, since, in this corpus, text is not a direct description of the image sequence but correlates with it indirectly in a myriad of ways. The dissertation also reports on an eye-tracker experiment conducted to validate the approach proposed to a text-oriented annotation. The experiment demonstrated that it is not possible to determine that text impacts gaze directly and was taken as a reinforcement to the approach of valorizing modes combination. Last, we present the Frame2 dataset, the product of the annotation task carried out for the corpus following both the methodology and guidelines proposed. The results achieved demonstrate that, at least for this TV genre but possibly also for others, a fine-grained semantic annotation tackling the diverse correlations that take place in a multimodal setting provides new perspective in multimodal comprehension modeling. Moreover, multimodal annotation also enriches the development of FrameNets, to the extent that correlations found between modalities can attest the modeling choices made by those building frame-based resources.Análises multimodais vêm crescendo em importância em várias abordagens da Linguística Cognitiva e em diversas áreas de aplicação, como o da Compreensão de Linguagem Natural. No entanto, há significativa carência de representações semânticas refinadas de objetos multimodais, especialmente em termos de integração de áreas como Processamento de Linguagem Natural e Visão Computacional, que são fundamentais para a implementação de multimodalidade no campo da Linguística Computacional. Nesta tese, propomos uma metodologia para estender o método de anotação da FrameNet ao domínio multimodal, uma vez que a FrameNet pode fornecer representações semânticas refinadas, particularmente com um banco de dados enriquecido por Qualia e outras relações interframe e intraframe, como é o caso do FrameNet Brasil. Para tornar a FrameNet Brasil capaz de realizar análises multimodais, delineamos a hipótese de que, assim como as palavras em uma frase evocam frames e organizam seus elementos na localidade sintática que os acompanha, os elementos visuais nos planos de vídeo também podem evocar frames e organizar seus elementos na tela ou trabalhar de forma complementar aos padrões de evocação de frames das sentenças narradas simultaneamente ao seu aparecimento na tela, proporcionando diferentes perfis e opções de perspectiva para a construção de sentido. O corpus anotado para testar a hipótese é composto por episódios de um programa televisivo de viagens brasileiro aclamado pela crítica como um exemplo de boas práticas em composição audiovisual. O gênero televisivo escolhido também configura um novo conjunto experimental para a pesquisa em imagem integrada e compreensão textual, uma vez que, neste corpus, o texto não é uma descrição direta da sequência de imagens, mas se correlaciona com ela indiretamente em uma miríade de formas diversa. A Tese também relata um experimento de rastreamento ocular realizado para validar a abordagem proposta para uma anotação orientada por texto. O experimento demonstrou que não é possível determinar que o texto impacta diretamente o direcionamento do olhar e foi tomado como um reforço para a abordagem de valorização da combinação de modos. Por fim, apresentamos o conjunto de dados Frame2, produto da tarefa de anotação realizada para o corpus seguindo a metodologia e as diretrizes propostas. Os resultados obtidos demonstram que, pelo menos para esse gênero de TV, mas possivelmente também para outros, uma anotação semântica refinada que aborde as diversas correlações que ocorrem em um ambiente multimodal oferece uma nova perspectiva na modelagem da compreensão multimodal. Além disso, a anotação multimodal também enriquece o desenvolvimento de FrameNets, na medida em que as correlações encontradas entre as modalidades podem atestar as escolhas de modelagem feitas por aqueles que criam recursos baseados em frames.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superio

    Neural Caption Generation for News Images

    No full text
    Automatic caption generation of images has gained significant interest. It gives rise to a lot of interesting image-related applications. For example, it could help in image/video retrieval and management of vast amount of multimedia data available on the Internet. It could also help in development of tools that can aid visually impaired individuals in accessing multimedia content. In this paper, we particularly focus on news images and propose a methodology for automatically generating captions for news paper articles consisting of a text paragraph and an image. We propose several deep neural network architectures built upon Recurrent Neural Networks. Results on a BBC News dataset show that our proposed approach outperforms a traditional method based on Latent Dirichlet Allocation using both automatic evaluation based on BLEU scores and human evaluation
    corecore