99 research outputs found

    A survey on knowledge-enhanced multimodal learning

    Full text link
    Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models

    Unsupervised Deep Learning of Visual Representations

    Get PDF
    Interpreting visual signals from complex imagery and video data with a few or no human annotation is challenging yet essential for realising the true values of deep learning techniques in real-world scenarios. During the past decade, deep learning has achieved unprecedented breakthroughs in lots of computer vision fields. Nonetheless, to optimise a large number of parameters in deep neural networks for deriving complex mappings from an input visual space to a discriminative feature representational space, the success of deep learning is heavily relying on a massive amount of human-annotated training data. Collecting such manual annotations are labour-intensive, especially in large-scale that has been proven to be critical to learning generalisable models applicable to new and unseen data. This dramatically limits the usability and scalability of deep learning when being applied in practice. This thesis aims to reduce the reliance of learning deep neural networks on exhaustive human annotations by proposing novel algorithms to learn the underlying visual semantics with insufficient/inadequate manual labels, denoted as generalised unsupervised learning. Based on the different assumptions on the available sources of knowledge used for learning, this thesis studies generalised unsupervised deep learning from four perspectives including learning without any labels by knowledge aggregation from local data structure and knowledge discovery from global data structure, transferring knowledge from relevant labels, and propagating knowledge from incomplete labels. Specifically, novel methods are introduced to address unresolved challenges in these problems as follows: Chapter 3 The first problem is aggregating knowledge from local data structure, which assumes that apparent visual similarities (pixel intensity) among images are encoded in local neighbourhoods in a feature representational space, providing partially the underlying semantic relationships among samples. This thesis studies discriminative representation learning in this problem, aiming to derive visual features which are discriminative in terms of image’s semantic class memberships. This problem is challenging because it is scarcely possible without ground-truth labels to accurately determine reliable neighbourhoods encoding the same underlying class concepts, considering the arbitrarily complex appearance patterns and variations both within and across classes. Existing methods learning from hypothetical inter-sample relationships tend to be error-propagated as the incorrect pairwise supervisions are prone to accumulate across the training process and impact the learned representations. To that end, this thesis proposes to progressively discover sample anchored / centred neighbourhoods to reason and learn the underlying semantic relationships among samples iteratively and accumulatively. Moreover, a novel progressive affinity diffusion process is presented to propagate reliable inter-sample relationships across adjacent neighbourhoods, so as to further identify the within-class visual variation from between-class similarity and bridge the gap between low-level imagery appearance (e.g. pixel intensity) and high-level semantic concepts (e.g. object class memberships). Chapter 4 The second problem is discovering knowledge from global data structure, which makes an assumption that visual similarity among samples of the same semantic classes is generally higher than that of different classes. This thesis investigates deep clustering for solving this problem which simultaneously learns visual features and data grouping without any labels. Existing unsupervised deep learning algorithms fails to benefit from joint representations and partitions learning by either overlooking global class memberships (e.g. contrastive representation learning) or basing on unreliable pseudo labels estimated by updating feature representations that are subject to error-propagation during training. To benefit clustering of images from discriminative visual features derived by a representation learning process, a Semantic Contrastive Learning method is proposed in this thesis, which concurrently optimises both instance visual similarities and cluster decision boundaries to reason about the hypotheses of semantic classes by their consensus. What’s more, based on the observation that assigning visually similar samples into different clusters will implicitly reduce both the intra-cluster compactness and inter-cluster diversity and lead to lower partition confidence, this thesis presents an online deep clustering method named PartItion Confidence mAximisation. It is established on the idea of learning the most semantically plausible data separation by maximising the “global” partition confidence of clustering solution using a novel differentiable partition uncertainty index. Chapter 5 The third problem is transferring knowledge from relevant labels, which assumes the availability of manual labels in relevant domains and the existence of common knowledge shared across domains. This thesis studies transfer clustering in this problem, which aims at learning the semantic class memberships of the unlabelled target data in a novel (target) domain by knowledge transfer from a labelled source domain. Whilst enormous efforts have been made on data annotation during the past decade, accumulating knowledge from existing labelled data to benefit understanding the persistently emerging unlabelled data is intuitively more efficient than exhaustively annotating new data. However, considering the unpredictable changing nature of imagery data distributions, the accumulated pre-learned knowledge does not transfer well without making strong assumptions about the learned source and the novel target domains, e.g. from domain adaptation to zero-shot and few-shot learning. To address this problem and effectively transfer knowledge between domains that are different in both data distributions and label spaces, this thesis proposes a self-SUPervised REMEdy method to align knowledge of domains by learning jointly from the intrinsically available relative (pairwise) imagery information in the unlabelled target domain and the prior-knowledge learned from the labelled source domain, so as to benefit from both transfer and self-supervised learning. Chapter 6 The last problem is propagating knowledge from incomplete labels, with the assumption that incomplete labels (e.g. collective or inexact) are usually easier to be collected and available but tend to be less reliable. This thesis investigates video activity localisation in this problem to locate a short moment (video segment) in an untrimmed and unstructured video according to a natural language query. To derive discriminative representations of video segments to accurately match with sentences, a temporal annotation of the precise start/end frame indices of each target moments are usually required. However, such temporal labels are not only harder to be collected than pairing videos with sentences as they require carefully going through videos frame-by-frame, but also subject to labelling uncertainty due to the intrinsic ambiguity in a video activity’s boundary. To reduce annotation cost for deriving universal visual-textual correlations, a Cross-sentence Relations Mining method is introduced in this thesis to align video segments and query sentences when only a paragraph description of activities (collective label) in a video is available but not per-sentence temporal labels. This is accomplished by exploring cross-sentence relationships in a paragraph as constraints to better interpret and match complex moment-wise temporal and semantic relationships in videos. Moreover, this thesis also studies the problem of propagating knowledge to avoid the negative impacts of inexact labels. To that end, an Elastic Moment Bounding method is proposed, which accommodates flexible and adaptive activity temporal boundaries towards modelling universal video-text correlations with tolerance to underlying temporal uncertainties in pre-fixed human annotations

    Computational models for image contour grouping

    Get PDF
    Contours are one dimensional curves which may correspond to meaningful entities such as object boundaries. Accurate contour detection will simplify many vision tasks such as object detection and image recognition. Due to the large variety of image content and contour topology, contours are often detected as edge fragments at first, followed by a second step known as {u0300}{u0300}contour grouping'' to connect them. Due to ambiguities in local image patches, contour grouping is essential for constructing globally coherent contour representation. This thesis aims to group contours so that they are consistent with human perception. We draw inspirations from Gestalt principles, which describe perceptual grouping ability of human vision system. In particular, our work is most relevant to the principles of closure, similarity, and past experiences. The first part of our contribution is a new computational model for contour closure. Most of existing contour grouping methods have focused on pixel-wise detection accuracy and ignored the psychological evidences for topological correctness. This chapter proposes a higher-order CRF model to achieve contour closure in the contour domain. We also propose an efficient inference method which is guaranteed to find integer solutions. Tested on the BSDS benchmark, our method achieves a superior contour grouping performance, comparable precision-recall curves, and more visually pleasant results. Our work makes progresses towards a better computational model of human perceptual grouping. The second part is an energy minimization framework for salient contour detection problem. Region cues such as color/texture homogeneity, and contour cues such as local contrast, are both useful for this task. In order to capture both kinds of cues in a joint energy function, topological consistency between both region and contour labels must be satisfied. Our technique makes use of the topological concept of winding numbers. By using a fast method for winding number computation, we find that a small number of linear constraints are sufficient for label consistency. Our method is instantiated by ratio-based energy functions. Due to cue integration, our method obtains improved results. User interaction can also be incorporated to further improve the results. The third part of our contribution is an efficient category-level image contour detector. The objective is to detect contours which most likely belong to a prescribed category. Our method, which is based on three levels of shape representation and non-parametric Bayesian learning, shows flexibility in learning from either human labeled edge images or unlabelled raw images. In both cases, our experiments obtain better contour detection results than competing methods. In addition, our training process is robust even with a considerable size of training samples. In contrast, state-of-the-art methods require more training samples, and often human interventions are required for new category training. Last but not least, in Chapter 7 we also show how to leverage contour information for symmetry detection. Our method is simple yet effective for detecting the symmetric axes of bilaterally symmetric objects in unsegmented natural scene images. Compared with methods based on feature points, our model can often produce better results for the images containing limited texture

    Statistical Parsing by Machine Learning from a Classical Arabic Treebank

    Get PDF
    Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

    The significance of silence. Long gaps attenuate the preference for ‘yes’ responses in conversation.

    Get PDF
    In conversation, negative responses to invitations, requests, offers and the like more often occur with a delay – conversation analysts talk of them as dispreferred. Here we examine the contrastive cognitive load ‘yes’ and ‘no’ responses make, either when given relatively fast (300 ms) or delayed (1000 ms). Participants heard minidialogues, with turns extracted from a spoken corpus, while having their EEG recorded. We find that a fast ‘no’ evokes an N400-effect relative to a fast ‘yes’, however this contrast is not present for delayed responses. This shows that an immediate response is expected to be positive – but this expectation disappears as the response time lengthens because now in ordinary conversation the probability of a ‘no’ has increased. Additionally, however, 'No' responses elicit a late frontal positivity both when they are fast and when they are delayed. Thus, regardless of the latency of response, a ‘no’ response is associated with a late positivity, since a negative response is always dispreferred and may require an account. Together these results show that negative responses to social actions exact a higher cognitive load, but especially when least expected, as an immediate response

    31th International Conference on Information Modelling and Knowledge Bases

    Get PDF
    Information modelling is becoming more and more important topic for researchers, designers, and users of information systems.The amount and complexity of information itself, the number of abstractionlevels of information, and the size of databases and knowledge bases arecontinuously growing. Conceptual modelling is one of the sub-areas ofinformation modelling. The aim of this conference is to bring together experts from different areas of computer science and other disciplines, who have a common interest in understanding and solving problems on information modelling and knowledge bases, as well as applying the results of research to practice. We also aim to recognize and study new areas on modelling and knowledge bases to which more attention should be paid. Therefore philosophy and logic, cognitive science, knowledge management, linguistics and management science are relevant areas, too. In the conference, there will be three categories of presentations, i.e. full papers, short papers and position papers

    Graph Neural Networks for Natural Language Processing: A Survey

    Full text link
    Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks. In this survey, we present a comprehensive overview onGraph Neural Networks(GNNs) for Natural Language Processing. We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models. We further introducea large number of NLP applications that are exploiting the power of GNNs and summarize thecorresponding benchmark datasets, evaluation metrics, and open-source codes. Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections. To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.Comment: 127 page
    corecore