2,517 research outputs found

    Attending Category Disentangled Global Context for Image Classification

    Full text link
    In this paper, we propose a general framework for image classification using the attention mechanism and global context, which could incorporate with various network architectures to improve their performance. To investigate the capability of the global context, we compare four mathematical models and observe the global context encoded in the category disentangled conditional generative model could give more guidance as "know what is task irrelevant will also know what is relevant". Based on this observation, we define a novel Category Disentangled Global Context (CDGC) and devise a deep network to obtain it. By attending CDGC, the baseline networks could identify the objects of interest more accurately, thus improving the performance. We apply the framework to many different network architectures and compare with the state-of-the-art on four publicly available datasets. Extensive results validate the effectiveness and superiority of our approach. Code will be made public upon paper acceptance.Comment: Under revie

    Learning with Limited Labeled Data in Biomedical Domain by Disentanglement and Semi-Supervised Learning

    Get PDF
    In this dissertation, we are interested in improving the generalization of deep neural networks for biomedical data (e.g., electrocardiogram signal, x-ray images, etc). Although deep neural networks have attained state-of-the-art performance and, thus, deployment across a variety of domains, similar performance in the clinical setting remains challenging due to its ineptness to generalize across unseen data (e.g., new patient cohort). We address this challenge of generalization in the deep neural network from two perspectives: 1) learning disentangled representations from the deep network, and 2) developing efficient semi-supervised learning (SSL) algorithms using the deep network. In the former, we are interested in designing specific architectures and objective functions to learn representations, where variations in the data are well separated, i.e., disentangled. In the latter, we are interested in designing regularizers that encourage the underlying neural function\u27s behavior toward a common inductive bias to avoid over-fitting the function to small labeled data. Our end goal is to improve the generalization of the deep network for the diagnostic model in both of these approaches. In disentangled representations, this translates to appropriately learning latent representations from the data, capturing the observed input\u27s underlying explanatory factors in an independent and interpretable way. With data\u27s expository factors well separated, such disentangled latent space can then be useful for a large variety of tasks and domains within data distribution even with a small amount of labeled data, thus improving generalization. In developing efficient semi-supervised algorithms, this translates to utilizing a large volume of the unlabelled dataset to assist the learning from the limited labeled dataset, commonly encountered situation in the biomedical domain. By drawing ideas from different areas within deep learning like representation learning (e.g., autoencoder), variational inference (e.g., variational autoencoder), Bayesian nonparametric (e.g., beta-Bernoulli process), learning theory (e.g., analytical learning theory), function smoothing (Lipschitz Smoothness), etc., we propose several leaning algorithms to improve generalization in the associated task. We test our algorithms on real-world clinical data and show that our approach yields significant improvement over existing methods. Moreover, we demonstrate the efficacy of the proposed models in the benchmark data and simulated data to understand different aspects of the proposed learning methods. We conclude by identifying some of the limitations of the proposed methods, areas of further improvement, and broader future directions for the successful adoption of AI models in the clinical environment

    On representation learning for generative models of text

    Full text link
    Cette thèse fait des petits pas dans la construction et la compréhension des systèmes d'apprentissage des représentations neuronales et des modèles génératifs pour le traitement du langage naturel. Il est présenté comme une thèse par article qui contient quatre travaux. Dans le premier article, nous montrons que l'apprentissage multi-tâches peut être utilisé pour combiner les biais inductifs de plusieurs tâches d'apprentissage auto-supervisées et supervisées pour apprendre des représentations de phrases distribuées de longueur fixe à usage général qui obtiennent des résultats solides sur les tâches d'apprentissage par transfert en aval sans tout modèle de réglage fin. Le deuxième article s'appuie sur le premier et présente un modèle génératif en deux étapes pour le texte qui modélise la distribution des représentations de phrases pour produire de nouveaux plongements de phrases qui servent de "contour neuronal" de haut niveau qui est reconstruit en mots avec un récurrent neuronal autorégressif conditionnel décodeur. Le troisième article étudie la nécessité de représentations démêlées pour la génération de texte contrôlable. Une grande partie des systèmes de génération de texte contrôlables reposent sur l'idée que le contrôle d'un attribut (ou d'un style) particulier nécessite la construction de représentations dissociées qui séparent le contenu et le style. Nous démontrons que les représentations produites dans des travaux antérieurs qui utilisent la formation contradictoire du domaine ne sont pas dissociées dans la pratique. Nous présentons ensuite une approche qui ne vise pas à apprendre des représentations démêlées et montrons qu'elle permet d'obtenir des résultats nettement meilleurs que les travaux antérieurs. Dans le quatrième article, nous concevons des modèles de langage de transformateur qui apprennent les représentations à plusieurs échelles de temps et montrent que ceux-ci peuvent aider à réduire l'empreinte mémoire importante de ces modèles. Il présente trois architectures multi-échelles différentes qui présentent des compromis favorables entre la perplexité et l'empreinte mémoire.This thesis takes baby steps in building and understanding neural representation learning systems and generative models for natural language processing. It is presented as a thesis by article that contains four pieces of work. In the first article, we show that multi-task learning can be used to combine the inductive biases of several self-supervised and supervised learning tasks to learn general-purpose fixed-length distributed sentence representations that achieve strong results on downstream transfer learning tasks without any model fine-tuning. The second article builds on the first and presents a two-step generative model for text that models the distribution of sentence representations to produce novel sentence embeddings that serves as a high level ``neural outline'' that is reconstructed to words with a conditional autoregressive RNN decoder. The third article studies the necessity of disentangled representations for controllable text generation. A large fraction of controllable text generation systems rely on the idea that control over a particular attribute (or style) requires building disentangled representations that separate content and style. We demonstrate that representations produced in previous work that uses domain adversarial training are not disentangled in practice. We then present an approach that does not aim to learn disentangled representations and show that it achieves significantly better results than prior work. In the fourth article, we design transformer language models that learn representations at multiple time scales and show that these can help address the large memory footprint these models typically have. It presents three different multi-scale architectures that exhibit favorable perplexity vs memory footprint trade-offs

    Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes

    Full text link
    In the wake of the explosive growth of machine learning (ML) usage, particularly within the context of emerging Large Language Models (LLMs), comprehending the semantic significance rooted in their internal workings is crucial. While causal analyses focus on defining semantics and its quantification, the gradient-based approach is central to explainable AI (XAI), tackling the interpretation of the black box. By synergizing these approaches, the exploration of how a model's internal mechanisms illuminate its causal effect has become integral for evidence-based decision-making. A parallel line of research has revealed that intersectionality - the combinatory impact of multiple demographics of an individual - can be structured in the form of an Averaged Treatment Effect (ATE). Initially, this study illustrates that the hateful memes detection problem can be formulated as an ATE, assisted by the principles of intersectionality, and that a modality-wise summarization of gradient-based attention attribution scores can delineate the distinct behaviors of three Transformerbased models concerning ATE. Subsequently, we show that the latest LLM LLaMA2 has the ability to disentangle the intersectional nature of memes detection in an in-context learning setting, with their mechanistic properties elucidated via meta-gradient, a secondary form of gradient. In conclusion, this research contributes to the ongoing dialogue surrounding XAI and the multifaceted nature of ML models

    Attention Mechanisms in Computer Vision: A Survey

    Full text link
    Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.Comment: 27 pages, 9 figure

    Humans in 4D: Reconstructing and Tracking Humans with Transformers

    Full text link
    We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstructions from HMR 2.0 as input to a tracking system that operates in 3D. This enables us to deal with multiple people and maintain identities through occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art results for tracking people from monocular video. Furthermore, we demonstrate the effectiveness of HMR 2.0 on the downstream task of action recognition, achieving significant improvements over previous pose-based action recognition approaches. Our code and models are available on the project website: https://shubham-goel.github.io/4dhumans/.Comment: Project Webpage: https://shubham-goel.github.io/4dhumans
    • …
    corecore