13 research outputs found
Multimodal Machine Learning for Personalized Interaction with Cultural Heritage
Multimodal machine learning involving textual and visual data is a fundamental research topic in the cross-field of natural language processing and computer vision. The employment of multimodal machine learning for personalized interaction in the cultural heritage domain has attracted increasing attention in recent years. On one hand, it can improve user experience for visitors in a physical museum when adopted in tasks such as multimodal question answering. On the other hand, it can make the online navigation of artworks easier, for example, by enabling the retrieval of artwork images when inputting some keywords and vice versa. In this thesis, we come up with different approaches based on graphical models and neural networks for multimodal machine learning. We investigate its value for personalized interaction with cultural heritage in various applications including multimodal question answering, image captioning and fine-grained cross-modal retrieval.
In multimodal question answering for cultural heritage, the goal is to retrieve the passages containing the correct answers to users' multimodal questions. In this task, a multimodal question consists of an image and a textual question on this image, and we have three main contributions. First, we build a dataset where the questions are collected from real users and the multimodal documents to search the passages are downloaded from the web. Three simple baseline retrieval models are constructed to experiment on the dataset. They refer to the text-to-text, text-to-image and multimodal match between a multimodal question and passage with a vector space model. We show that the model using both images and text to implement the match is optimal and provides more evidence to obtain the correct passage. Second, a graphical model based on Markov networks that allow different evidence information to be encoded together is designed. Experiments prove the importance of the interplay between different evidence information in the graph to improve retrieval capability. Finally, we add the captions generated for the question images into the graphical model and present its potential to boost the retrieval performance.
In image captioning for ancient artworks, the aim is to generate the caption sentence for a given image of an artwork. To this end, we propose an artwork type enriched image captioning model based on the neural method encoder-decoder framework. The encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. Multiple approaches are adopted to integrate the artwork type into the captioning model, and the one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder is superior.
From multimodal question answering to image captioning, we transfer the granularity of our task from passage to sentence for the text. In fine-grained cross-modal retrieval of ancient artifacts, we move to a finer-grained level on both the visual and textual parts, conducting retrieval between fragments of images and text for cultural items. In this contribution, we introduce a weakly supervised alignment model where the correspondence between the input training image fragments and phrases is not known but the two items referring to the same artwork are treated as a positive pair. The model exploits the latent alignment between fragments across modalities using attention mechanisms by first projecting them into a shared common semantic space, and then it is trained by making the image-text similarity of the positive pair larger in the common space. During this process, we encode the inputs of our model with hierarchical encodings and remove irrelevant fragments with different indicator functions. We also study the techniques to augment the training data with synthetic relevant fragments due to limited training data. We rank the test image fragments and noun phrases by their inter-modal similarity in the learned common space. This work demonstrates the importance of removing irrelevant fragments when representing the image and text as a combination of their fragments.
Multimodal machine learning is a challenging topic when applied to the cultural heritage field due to the noisy text, large variance of image patterns and less training data, etc. Therefore, the contributions in this dissertation serve as a case study and starting point to guide future research.status: publishe
Simple baseline models for multimodal question answering in the cultural heritage domain
With the increasing use of mobile devices, taking pictures becomes an easy and natural way for people to interact with cultural objects. In such circumstances, we propose multimodal question answering (MQA) to offer personalized answers to users’ questions. In this research, a query from an end user consists of an image of an artwork and a textual question referring to this image. For this purpose, we built a dataset especially for MQA in the cultural heritage domain (Sheng et al., 2016). In the present study, we give a detailed introduction about this multimodal question answering system and its advances. Three baseline models are implemented for retrieving answers from the documentation in the dataset, including a text-matching model, an image-matching model and a multimodal intersection model. These three models use different methods to perform the matching. The text-matching model ranks the candidate passages purely by their similarity with the textual part of a multimodal query. The image-matching model ranks the candidate passages purely by the similarity between the images around these passages that match the visual query. The intersection model performs the ranking task by comparing both the textual and visual part of a multimodal query with the content in documentation and taking the shared passages that were found relevant. The mean average precision (MAP) score is adopted as main evaluation criterion for these three baseline models and it reaches a highest value of 0.2079 when using the intersection model. NIL recall and precision will be reported instead if no answer exists in the document collection for a particular multimodal query.status: publishe
A Markov network based passage retrieval method for multimodal question answering in the cultural heritage domain
In this paper, we propose a Markov network based graphical framework to perform passage retrieval for multimodal question answering (MQA) with weak supervision in the cultural heritage domain. This framework encodes the dependencies between a question's feature information and the passage containing its answer, with the assumption that there is a latent alignment between a question and its candidate answer. Experiments on a challenging multi-modal dataset show that this framework achieves an improvement of 5% in terms of mean average precision (mAP) compared with a state-of-the-art method employing the same features namely (i) image match and (ii) word co-occurrence information of a passage and a question. We additionally construct two extended graphical frameworks integrating one more feature, namely (question type)-(named entity) match, into this framework in order to further boost the performance. The performance has been further improved by 2% in terms of mAP in one of the extended models.status: publishe
Fine-Grained Cross-Modal Retrieval for Cultural Items with Focal Attention and Hierarchical Encodings
In this paper, we target the tasks of fine-grained image–text alignment and cross-modal retrieval in the cultural heritage domain as follows: (1) given an image fragment of an artwork, we retrieve the noun phrases that describe it; (2) given a noun phrase artifact attribute, we retrieve the corresponding image fragment it specifies. To this end, we propose a weakly supervised alignment model where the correspondence between the input training visual and textual fragments is not known but their corresponding units that refer to the same artwork are treated as a positive pair. The model exploits the latent alignment between fragments across modalities using attention mechanisms by first projecting them into a shared common semantic space; the model is then trained by increasing the image–text similarity of the positive pair in the common space. During this process, we encode the inputs of our model with hierarchical encodings and remove irrelevant fragments with different indicator functions. We also study techniques to augment the limited training data with synthetic relevant textual fragments and transformed image fragments. The model is later fine-tuned by a limited set of small-scale image–text fragment pairs. We rank the test image fragments and noun phrases by their intermodal similarity in the learned common space. Extensive experiments demonstrate that our proposed models outperform two state-of-the-art methods adapted to fine-grained cross-modal retrieval of cultural items for two benchmark datasets
Effect of Land Use Change on Soil Carbon Storage over the Last 40 Years in the Shi Yang River Basin, China
Accounting for one quarter of China’s land area, the endorheic Shiyang River basin is a vast semi-arid to arid region in China’s northwest. Exploring the impact of changes in land use on this arid area’s carbon budget under global warming is a key component to global climate change research. Variation in the region’s soil carbon storage due to land use changes occurring between 1973 and 2012 was estimated. The results show that land use change has a significant impact on the soil carbon budget, with soil carbon storage having decreased by 3.89 Tg between 1973 and 2012. Grassland stored the greatest amount of soil carbon (114.34 Mg ha−1), whereas considerably lower carbon storage occurred in woodland (58.53 Mg ha−1), cropland (26.75 Mg ha−1) and unused land (13.47 Mg ha−1). Grasslands transformed into cropland, and woodlands degraded into grassland have substantially reduced soil carbon storage, suggesting that measures should be adopted to reverse this trend to improve soil productivity
The Curcumin Analogs 2-Pyridyl Cyclohexanone Induce Apoptosis via Inhibition of the JAK2–STAT3 Pathway in Human Esophageal Squamous Cell Carcinoma Cells
Multiple modifications to the structure of curcumin have been investigated with an aim to improve its potency and biochemical properties. Previously, we have synthesized a series of curcumin analogs. In the present study, the anticancer effect of 2-pyridyl cyclohexanone, one of the curcumin analogs, on esophageal carcinoma Eca109 and EC9706 cell lines and its molecular mechanisms were investigated. 2-Pyridyl cyclohexanone inhibited the proliferation of Eca109 and EC9706 cells by inducing apoptosis as indicated by morphological changes, membrane phospholipid phosphatidylserine ectropion, caspase 3 activation, and cleavage of poly(ADP-ribose) polymerase. Mechanistic studies indicated that 2-pyridyl cyclohexanone disrupted mitochondrial membrane potential, disturbed the balance of the Bcl-2 family proteins, and triggered apoptosis via the mitochondria-mediated intrinsic pathway. In 2-pyridine cyclohexanone-treated cells, the phosphorylation levels of JAK2 and STAT3 were dose-dependently decreased and p38 and p-ERK signals were notably activated in a dose-dependent manner. Moreover, we found that the addition of S3I-201, a STAT3 inhibitor, led to a decreased expression level of Bcl-2 in Eca109 cells. The chromatin immunoprecipitation assay demonstrated that STAT3 bound to the promoter of Bcl-2 in the Eca109 cells. Furthermore, the mutation of four STAT3 binding sites (−1733/−1723, −1627/−1617, −807/−797, and −134/−124) on the promote of Bcl-2 gene alone attenuated the transcriptional activation of STAT3. In addition, down-regulation of STAT3 resulted in less of transcriptional activity of STAT3 on Bcl-2 expression. These data provide a potential molecular mechanism of the apoptotic induction function of 2-pyridyl cyclohexanone, and emphasize its important roles as a therapeutic agent for esophageal squamous carcinoma