26 research outputs found
Recurrent Fusion Network for Image Captioning
Recently, much advance has been made in image captioning, and an
encoder-decoder framework has been adopted by all the state-of-the-art models.
Under this framework, an input image is encoded by a convolutional neural
network (CNN) and then translated into natural language with a recurrent neural
network (RNN). The existing models counting on this framework merely employ one
kind of CNNs, e.g., ResNet or Inception-X, which describe image contents from
only one specific view point. Thus, the semantic meaning of an input image
cannot be comprehensively understood, which restricts the performance of
captioning. In this paper, in order to exploit the complementary information
from multiple encoders, we propose a novel Recurrent Fusion Network (RFNet) for
tackling image captioning. The fusion process in our model can exploit the
interactions among the outputs of the image encoders and then generate new
compact yet informative representations for the decoder. Experiments on the
MSCOCO dataset demonstrate the effectiveness of our proposed RFNet, which sets
a new state-of-the-art for image captioning.Comment: ECCV-1
Control and Analysis for Sequential Information based on Machine Learning
Sequential information is crucial for real-world applications that are related to time, which is same with time-series being described by sequence data followed by temporal order and regular intervals. In this thesis, we consider four major tasks of sequential information that include sequential trend prediction, control strategy optimisation, visual-temporal interpolation and visual-semantic sequential alignment. We develop machine learning theories and provide state-of-the-art models for various real-world applications that involve sequential processes, including the industrial batch process, sequential video inpainting, and sequential visual-semantic image captioning. The ultimate goal is about designing a hybrid framework that can unify diverse sequential information analysis and control systems
For industrial process, control algorithms rely on simulations to find the optimal control strategy. However, few machine learning techniques can control the process using raw data, although some works use ML to predict trends. Most control methods rely on amounts of previous experiences, and cannot execute future information to optimize the control strategy. To improve the effectiveness of the industrial process, we propose improved reinforcement learning approaches that can modify the control strategy. We also propose a hybrid reinforcement virtual learning approach to optimise the long-term control strategy. This approach creates a virtual space that interacts with reinforcement learning to predict a virtual strategy without conducting any real experiments, thereby improving and optimising control efficiency.
For sequential visual information analysis, we propose a dual-fusion transformer model to tackle the sequential visual-temporal encoding in video inpainting tasks. Our framework includes a flow-guided transformer with dual attention fusion, and we observe that the sequential information is effectively processed, resulting in promising inpainting videos.
Finally, we propose a cycle-based captioning model for the analysis of sequential visual-semantic information. This model augments data from two views to optimise caption generation from an image, overcoming new few-shot and zero-shot settings. The proposed model can generate more accurate and informative captions by leveraging sequential visual-semantic information.
Overall, the thesis contributes to analysing and manipulating sequential information in multi-modal real-world applications. Our flexible framework design provides a unified theoretical foundation to deploy sequential information systems in distinctive application domains. Considering the diversity of challenges addressed in this thesis, we believe our technique paves the pathway towards versatile AI in the new era