863 research outputs found

    Hyperbolic Interaction Model For Hierarchical Multi-Label Classification

    Full text link
    Different from the traditional classification tasks which assume mutual exclusion of labels, hierarchical multi-label classification (HMLC) aims to assign multiple labels to every instance with the labels organized under hierarchical relations. Besides the labels, since linguistic ontologies are intrinsic hierarchies, the conceptual relations between words can also form hierarchical structures. Thus it can be a challenge to learn mappings from word hierarchies to label hierarchies. We propose to model the word and label hierarchies by embedding them jointly in the hyperbolic space. The main reason is that the tree-likeness of the hyperbolic space matches the complexity of symbolic data with hierarchical structures. A new Hyperbolic Interaction Model (HyperIM) is designed to learn the label-aware document representations and make predictions for HMLC. Extensive experiments are conducted on three benchmark datasets. The results have demonstrated that the new model can realistically capture the complex data structures and further improve the performance for HMLC comparing with the state-of-the-art methods. To facilitate future research, our code is publicly available

    IsoScore: Measuring the Uniformity of Embedding Space Utilization

    Full text link
    The recent success of distributed word representations has led to an increased interest in analyzing the properties of their spatial distribution. Several studies have suggested that contextualized word embedding models do not isotropically project tokens into vector space. However, current methods designed to measure isotropy, such as average random cosine similarity and the partition score, have not been thoroughly analyzed and are not appropriate for measuring isotropy. We propose IsoScore: a novel tool that quantifies the degree to which a point cloud uniformly utilizes the ambient vector space. Using rigorously designed tests, we demonstrate that IsoScore is the only tool available in the literature that accurately measures how uniformly distributed variance is across dimensions in vector space. Additionally, we use IsoScore to challenge a number of recent conclusions in the NLP literature that have been derived using brittle metrics of isotropy. We caution future studies from using existing tools to measure isotropy in contextualized embedding space as resulting conclusions will be misleading or altogether inaccurate.Comment: ACL 2022 camera ready versio

    Manifold-based Verbalizer Space Re-embedding for Tuning-free Prompt-based Classification

    Full text link
    Prompt-based classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.Comment: 11 pages, 3 figure

    Deep Learning for Embedding and Integrating Multimodal Biomedical Data

    Get PDF
    Biomedical data is being generated in extremely high throughput and high dimension by technologies in areas ranging from single-cell genomics, proteomics, and transcriptomics (cytometry, single-cell RNA and ATAC sequencing) to neuroscience and cognition (fMRI and PET) to pharmaceuticals (drug perturbations and interactions). These new and emerging technologies and the datasets they create give an unprecedented view into the workings of their respective biological entities. However, there is a large gap between the information contained in these datasets and the insights that current machine learning methods can extract from them. This is especially the case when multiple technologies can measure the same underlying biological entity or system. By separately analyzing the same system but from different views gathered by different data modalities, patterns are left unobserved if they only emerge from the multi-dimensional joint representation of all of the modalities together. Through an interdisciplinary approach that emphasizes active collaboration with data domain experts, my research has developed models for data integration, extracting important insights through the joint analysis of varied data sources. In this thesis, I discuss models that address this task of multi-modal data integration, especially generative adversarial networks (GANs) and autoencoders (AEs). My research has been focused on using both of these models in a generative way for concrete problems in cutting-edge scientific applications rather than the exclusive focus on the generation of high-resolution natural images. The research in this thesis is united around ideas of building models that can extract new knowledge from scientific data inaccessible to currently existing methods

    N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding

    Get PDF
    Deep clustering has increasingly been demonstrating superiority over conventional shallow clustering algorithms. Deep clustering algorithms usually combine representation learning with deep neural networks to achieve this performance, typically optimizing a clustering and non-clustering loss. In such cases, an autoencoder is typically connected with a clustering network, and the final clustering is jointly learned by both the autoencoder and clustering network. Instead, we propose to learn an autoencoded embedding and then search this further for the underlying manifold. For simplicity, we then cluster this with a shallow clustering algorithm, rather than a deeper network. We study a number of local and global manifold learning methods on both the raw data and autoencoded embedding, concluding that UMAP in our framework is best able to find the most clusterable manifold in the embedding, suggesting local manifold learning on an autoencoded embedding is effective for discovering higher quality discovering clusters. We quantitatively show across a range of image and time-series datasets that our method has competitive performance against the latest deep clustering algorithms, including out-performing current state-of-the-art on several. We postulate that these results show a promising research direction for deep clustering. The code can be found at https://github.com/rymc/n2dComment: Accepted at ICPR 202
    • …
    corecore