3,006 research outputs found

    From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

    Get PDF
    Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

    ClusterFormer: Clustering As A Universal Visual Learner

    Full text link
    This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision

    Viewfinder: final activity report

    Get PDF
    The VIEW-FINDER project (2006-2009) is an 'Advanced Robotics' project that seeks to apply a semi-autonomous robotic system to inspect ground safety in the event of a fire. Its primary aim is to gather data (visual and chemical) in order to assist rescue personnel. A base station combines the gathered information with information retrieved from off-site sources. The project addresses key issues related to map building and reconstruction, interfacing local command information with external sources, human-robot interfaces and semi-autonomous robot navigation. The VIEW-FINDER system is a semi-autonomous; the individual robot-sensors operate autonomously within the limits of the task assigned to them, that is, they will autonomously navigate through and inspect an area. Human operators monitor their operations and send high level task requests as well as low level commands through the interface to any nodes in the entire system. The human interface has to ensure the human supervisor and human interveners are provided a reduced but good and relevant overview of the ground and the robots and human rescue workers therein

    Visual-Semantic Learning

    Get PDF
    Visual-semantic learning is an attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., visual signals (i.e., images and videos) and natural language (i.e., captions and questions). It requires memorizing the rich information in a single modality and a joint comprehension of multiple modalities. Artificial intelligence (AI) systems with human-level intelligence are claimed to learn like humans, such as efficiently leveraging brain memory for better comprehension, rationally incorporating common-sense knowledge into reasoning, quickly gaining in-depth understanding given a few samples, and analyzing relationships among abundant and informative events. However, these intelligence capacities are effortless for humans but challenging for machines. To bridge the discrepancy between human-level intelligence and present-day visual-semantic learning, we start from its basic understanding ability by studying the visual question answering (e.g., Image-QA and Video-QA) tasks from the perspectives of memory augmentation and common-sense knowledge incorporation. Furthermore, we stretch it to a more challenging situation with limited and partially unlabeled training data (i.e., Few-shot Visual-Semantic Learning) to imitate the fast learning ability of humans. Finally, to further enhance visual-semantic performance in natural videos with numerous spatio-temporal dynamics, we investigate exploiting event-correlated information for a comprehensive understanding of cross-modal semantics. To study the essential visual-semantic understanding ability of the human brain with memory, we first propose a novel Memory Augmented Deep Recurrent Neural Network (i.e., MA-DRNN) model for Video-QA, which features a new method for encoding videos and questions, and memory augmentation using the emerging Differentiable Neural Computer (i.e., DNC). Specifically, we encode semantic (i.e., questions) information before visual (i.e., videos) information, which leads to better visual-semantic representations. Moreover, we leverage Differentiable Neural Computer (with external memory) to store and retrieve valuable information in questions and videos and model the long-term visual-semantic dependency. In addition to basic understanding, to tackle visual-semantic reasoning that requires external knowledge beyond visible contents (e.g., KB-Image-QA), we propose a novel framework that endows the model with capabilities of answering more general questions and achieves better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (i.e., MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question-related relation phrases, each delivering two complementary clues to retrieve the supporting facts from an external knowledge base (i.e., KB). These facts are encoded into a continuous embedding space using a content-addressable memory. Afterward, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score. Furthermore, to enable a fast, in-depth understanding given a small number of samples, especially with heterogeneity in the multi-modal scenarios such as image question answering (i.e., Image-QA) and image captioning (i.e., IC), we study the few-shot visual-semantic learning and present the Hierarchical Graph ATtention Network (i.e., HGAT). This two-stage network models the intra- and inter-modal relationships with limited image-text samples. The main contributions of HGAT can be summarized as follows: 1) it sheds light on tackling few-shot multi-modal learning problems, which focuses primarily, but not exclusively, on visual and semantic modalities, through better exploitation of the intra-relationship of each modality and an attention-based co-learning framework between modalities using a hierarchical graph-based architecture; 2) it achieves superior performance on both visual question answering and image captioning in the few-shot setting; 3) it can be easily extended to the semi-supervised setting where image-text samples are partially unlabeled. Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the two modalities, one limitation of the predominant visual-semantic methods is the lack of reasoning with event correlation, sensing, and analyzing relationships among abundant and informative events contained in the video. To this end, we introduce the dense caption modality as a new auxiliary and distill event-correlated information to infer the correct answer. We propose a novel end-to-end trainable model, Event-Correlated Graph Neural Networks (EC-GNNs), to perform cross-modal reasoning over information from the three modalities (i.e., caption, video, and question). Besides exploiting a new modality, we employ cross-modal reasoning modules to explicitly model inter-modal relationships and aggregate relevant information across different modalities. We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning. To evaluate our proposed models, we conduct extensive experiments on VTW, MSVD-QA, and TGIF-QA datasets for Video-QA task, Toronto COCO-QA, Visual Genome-QA datasets for few-shot Image-QA task, COCO-FITB dataset for few-shot IC task, and FVQA, Visual7W + ConceptNet datasets for KB-Image-QA task. The experimental results justify these models’ effectiveness and superiority over baseline methods

    Validação de heterogeneidade estrutural em dados de Crio-ME por comitês de agrupadores

    Get PDF
    Orientadores: Fernando José Von Zuben, Rodrigo Villares PortugalDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Análise de Partículas Isoladas é uma técnica que permite o estudo da estrutura tridimensional de proteínas e outros complexos macromoleculares de interesse biológico. Seus dados primários consistem em imagens de microscopia eletrônica de transmissão de múltiplas cópias da molécula em orientações aleatórias. Tais imagens são bastante ruidosas devido à baixa dose de elétrons utilizada. Reconstruções 3D podem ser obtidas combinando-se muitas imagens de partículas em orientações similares e estimando seus ângulos relativos. Entretanto, estados conformacionais heterogêneos frequentemente coexistem na amostra, porque os complexos moleculares podem ser flexíveis e também interagir com outras partículas. Heterogeneidade representa um desafio na reconstrução de modelos 3D confiáveis e degrada a resolução dos mesmos. Entre os algoritmos mais populares usados para classificação estrutural estão o agrupamento por k-médias, agrupamento hierárquico, mapas autoorganizáveis e estimadores de máxima verossimilhança. Tais abordagens estão geralmente entrelaçadas à reconstrução dos modelos 3D. No entanto, trabalhos recentes indicam ser possível inferir informações a respeito da estrutura das moléculas diretamente do conjunto de projeções 2D. Dentre estas descobertas, está a relação entre a variabilidade estrutural e manifolds em um espaço de atributos multidimensional. Esta dissertação investiga se um comitê de algoritmos de não-supervisionados é capaz de separar tais "manifolds conformacionais". Métodos de "consenso" tendem a fornecer classificação mais precisa e podem alcançar performance satisfatória em uma ampla gama de conjuntos de dados, se comparados a algoritmos individuais. Nós investigamos o comportamento de seis algoritmos de agrupamento, tanto individualmente quanto combinados em comitês, para a tarefa de classificação de heterogeneidade conformacional. A abordagem proposta foi testada em conjuntos sintéticos e reais contendo misturas de imagens de projeção da proteína Mm-cpn nos estados "aberto" e "fechado". Demonstra-se que comitês de agrupadores podem fornecer informações úteis na validação de particionamentos estruturais independetemente de algoritmos de reconstrução 3DAbstract: Single Particle Analysis is a technique that allows the study of the three-dimensional structure of proteins and other macromolecular assemblies of biological interest. Its primary data consists of transmission electron microscopy images from multiple copies of the molecule in random orientations. Such images are very noisy due to the low electron dose employed. Reconstruction of the macromolecule can be obtained by averaging many images of particles in similar orientations and estimating their relative angles. However, heterogeneous conformational states often co-exist in the sample, because the molecular complexes can be flexible and may also interact with other particles. Heterogeneity poses a challenge to the reconstruction of reliable 3D models and degrades their resolution. Among the most popular algorithms used for structural classification are k-means clustering, hierarchical clustering, self-organizing maps and maximum-likelihood estimators. Such approaches are usually interlaced with the reconstructions of the 3D models. Nevertheless, recent works indicate that it is possible to infer information about the structure of the molecules directly from the dataset of 2D projections. Among these findings is the relationship between structural variability and manifolds in a multidimensional feature space. This dissertation investigates whether an ensemble of unsupervised classification algorithms is able to separate these "conformational manifolds". Ensemble or "consensus" methods tend to provide more accurate classification and may achieve satisfactory performance across a wide range of datasets, when compared with individual algorithms. We investigate the behavior of six clustering algorithms both individually and combined in ensembles for the task of structural heterogeneity classification. The approach was tested on synthetic and real datasets containing a mixture of images from the Mm-cpn chaperonin in the "open" and "closed" states. It is shown that cluster ensembles can provide useful information in validating the structural partitionings independently of 3D reconstruction methodsMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric
    corecore