9 research outputs found

    NCGNN: Node-level Capsule Graph Neural Network

    Full text link
    Message passing has evolved as an effective tool for designing Graph Neural Networks (GNNs). However, most existing works naively sum or average all the neighboring features to update node representations, which suffers from the following limitations: (1) lack of interpretability to identify crucial node features for GNN's prediction; (2) over-smoothing issue where repeated averaging aggregates excessive noise, making features of nodes in different classes over-mixed and thus indistinguishable. In this paper, we propose the Node-level Capsule Graph Neural Network (NCGNN) to address these issues with an improved message passing scheme. Specifically, NCGNN represents nodes as groups of capsules, in which each capsule extracts distinctive features of its corresponding node. For each node-level capsule, a novel dynamic routing procedure is developed to adaptively select appropriate capsules for aggregation from a subgraph identified by the designed graph filter. Consequently, as only the advantageous capsules are aggregated and harmful noise is restrained, over-mixing features of interacting nodes in different classes tends to be avoided to relieve the over-smoothing issue. Furthermore, since the graph filter and the dynamic routing identify a subgraph and a subset of node features that are most influential for the prediction of the model, NCGNN is inherently interpretable and exempt from complex post-hoc explanations. Extensive experiments on six node classification benchmarks demonstrate that NCGNN can well address the over-smoothing issue and outperforms the state of the arts by producing better node embeddings for classification

    Object-Centric Learning with Slot Attention

    Full text link
    Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks

    Algorithms and Applications of Novel Capsule Networks

    Get PDF
    Convolutional neural networks, despite their profound impact in countless domains, suffer from significant shortcomings. Linearly-combined scalar feature representations and max pooling operations lead to spatial ambiguities and a lack of robustness to pose variations. Capsule networks can potentially alleviate these issues by storing and routing the pose information of extracted features through their architectures, seeking agreement between the lower-level predictions of higher-level poses at each layer. In this dissertation, we make several key contributions to advance the algorithms of capsule networks in segmentation and classification applications. We create the first ever capsule-based segmentation network in the literature, SegCaps, by introducing a novel locally-constrained dynamic routing algorithm, transformation matrix sharing, the concept of a deconvolutional capsule, extension of the reconstruction regularization to segmentation, and a new encoder-decoder capsule architecture. Following this, we design a capsule-based diagnosis network, D-Caps, which builds off SegCaps and introduces a novel capsule-average pooling technique to handle to larger medical imaging data. Finally, we design an explainable capsule network, X-Caps, which encodes high-level visual object attributes within its capsules by utilizing a multi-task framework and a novel routing sigmoid function which independently routes information from child capsules to parents. Predictions come with human-level explanations, via object attributes, and a confidence score, by training our network directly on the distribution of expert labels, modeling inter-observer agreement and punishing over/under confidence during training. This body of work constitutes significant algorithmic advances to the application of capsule networks, especially in real-world biomedical imaging data

    Towards unified visual perception

    Get PDF
    This thesis explores the frontier of visual perception in computer vision by leveraging the capabilities of Vision Transformers (ViTs) to create a unified framework that addresses cross-task and cross-granularity challenges. Drawing inspiration from the human visual system's ability to process visual information at varying levels of detail and the success of Transformers in Natural Language Processing (NLP), we aim to bridge the gap between broad visual concepts and their fine-grained counterparts. Our investigation is structured into three parts. First, we delve into a range of training methods and architectures for ViTs, with the goal of gathering valuable insights. These insights are intended to guide the optimization of ViTs in the subsequent phase of our research, ensuring we build a strong foundation for enhancing their performance in complex visual tasks. Second, our focus shifts towards the recognition of fine-grained visual concepts, employing precise annotations to delve deeper into the intricate details of visual scenes. Here, we tackle the challenge of discerning and classifying objects and pixels with remarkable accuracy, leveraging the foundational insights gained from our initial explorations of ViTs. In the final part of our thesis, we demonstrate how language can serve as a bridge, enabling vision-language models, which are only trained to recognize images, to navigate countless visual concepts on fine-grained entities like objects and pixels without the need for fine-tuning

    Enhanced Capsule-based Networks and Their Applications

    Get PDF
    Current deep models have achieved human-like accuracy in many computer vision tasks, even defeating humans sometimes. However, these deep models still suffer from significant weaknesses. To name a few, it is hard to interpret how they reach decisions, and it is easy to attack them with tiny perturbations. A capsule, usually implemented as a vector, represents an object or object part. Capsule networks and GLOM consist of classic and generalized capsules respectively, where the difference is whether the capsule is limited to representing a fixed thing. Both models are designed to parse their input into a part-whole hierarchy as humans do, where each capsule corresponds to an entity of the hierarchy. That is, the first layer finds the lowest-level vision patterns, and the following layers assemble the larger patterns till the entire object, e.g., from nostril to nose, face, and person. This design enables capsule networks and GLOM the potential of solving the above problems of current deep models, by mimicking how humans overcome these problems with the part-whole hierarchy. However, their current implementations are not perfect on fulfilling their potentials and require further improvements, including intrinsic interpretability, guaranteed equivariance, robustness to adversarial attacks, a more efficient routing algorithm, compatibility with other models, etc. In this dissertation, I first briefly introduce the motivations, essential ideas, and existing implementations of capsule networks and GLOM, then focus on addressing some limitations of these implementations. The improvements are briefly summarized as follows. First, a fast non-iterative routing algorithm is proposed for capsule networks, which facilitates their applications in many tasks such as image classification and segmentation. Second, a new architecture, named Twin-Islands, is proposed based on GLOM, which achieves the many desired properties such as equivariance, model interpretability, and adversarial robustness. Lastly, the essential idea of capsule networks and GLOM is re-implemented in a small group ensemble block, which can also be used along with other types of neural networks, e.g., CNNs, on various tasks such as image classification, segmentation, and retrieval

    Capsule Networks for Video Understanding

    Get PDF
    With the increase of videos available online, it is more important than ever to learn how to process and understand video data. Although convolutional neural networks have revolutionized the representation learning from images and videos, they do not explicitly model entities within the given input. It would be useful for learned models to be able to represent part-to-whole relationships within a given image or video. To this end, a novel neural network architecture - capsule networks - has been proposed. Capsule networks add extra structure to allow for the modeling of entities and has shown great promise when applied to image data. By grouping neural activations and propagating information from one layer to the next through a routing-by-agreement procedure, capsule networks are able to learn part-to-whole relationships as well as robust object representations. In this dissertation, we explore how capsule networks can be generalized to video and be used to effectively solve several video understanding problems. First, we generalize capsule networks from the image domain so that it can process 3-dimensional video data. Our proposed video capsule network (VideoCapsuleNet) tackles the problem of video action detection. We introduce capsule-pooling in the convolutional capsule layer to make the voting algorithm tractable in the 3-dimensional video domain. The network\u27s routing-by-agreement inherently models the action representations and various action characteristics are captured by the predicted capsules. We show that VideoCapsuleNet is able to successfully produce pixel-wise localizations of actions present in videos. While action detection only requires a coarse localization, we show that video capsule networks can generate fine-grained segmentations. To that end, we propose a capsule-based approach for video object segmentation, CapsuleVOS, which can segment several frames at once conditioned on a reference frame and segmentation mask. This conditioning is performed through a novel routing algorithm for attention-based efficient capsule selection. We address two challenging issues in video object segmentation: segmentation of small objects and occlusion of objects across time. The first issue is addressed with a zooming module; the second, is dealt with by a novel memory module based on recurrent neural networks. Above we show that capsule networks can effectively localize actors and objects within videos. Next, we address the problem of integration of video and text for the task of actor and action video segmentation from a sentence. We propose a novel capsule-based approach to perform pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, and propose a visual-textual routing mechanism for the fusion of these capsules to successfully localize the actor and action within all frames of a video. The previous works are all fully supervised: they are all trained on manually annotated data, which is often time-consuming and costly to acquire. Finally, we propose a novel method for self-supervised learning which does not rely on manually annotated data. We present a capsule network that jointly learns high-level concepts and their relationships across different low-level multimodal (video, audio, and text) input representations. To adapt the capsules to large-scale input data, we propose a routing by self-attention mechanism that selects relevant capsules which are then used to generate a final joint multimodal feature representation. This allows us to learn robust representations from noisy video data and to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient
    corecore