863 research outputs found
Hyperbolic Interaction Model For Hierarchical Multi-Label Classification
Different from the traditional classification tasks which assume mutual
exclusion of labels, hierarchical multi-label classification (HMLC) aims to
assign multiple labels to every instance with the labels organized under
hierarchical relations. Besides the labels, since linguistic ontologies are
intrinsic hierarchies, the conceptual relations between words can also form
hierarchical structures. Thus it can be a challenge to learn mappings from word
hierarchies to label hierarchies. We propose to model the word and label
hierarchies by embedding them jointly in the hyperbolic space. The main reason
is that the tree-likeness of the hyperbolic space matches the complexity of
symbolic data with hierarchical structures. A new Hyperbolic Interaction Model
(HyperIM) is designed to learn the label-aware document representations and
make predictions for HMLC. Extensive experiments are conducted on three
benchmark datasets. The results have demonstrated that the new model can
realistically capture the complex data structures and further improve the
performance for HMLC comparing with the state-of-the-art methods. To facilitate
future research, our code is publicly available
IsoScore: Measuring the Uniformity of Embedding Space Utilization
The recent success of distributed word representations has led to an
increased interest in analyzing the properties of their spatial distribution.
Several studies have suggested that contextualized word embedding models do not
isotropically project tokens into vector space. However, current methods
designed to measure isotropy, such as average random cosine similarity and the
partition score, have not been thoroughly analyzed and are not appropriate for
measuring isotropy. We propose IsoScore: a novel tool that quantifies the
degree to which a point cloud uniformly utilizes the ambient vector space.
Using rigorously designed tests, we demonstrate that IsoScore is the only tool
available in the literature that accurately measures how uniformly distributed
variance is across dimensions in vector space. Additionally, we use IsoScore to
challenge a number of recent conclusions in the NLP literature that have been
derived using brittle metrics of isotropy. We caution future studies from using
existing tools to measure isotropy in contextualized embedding space as
resulting conclusions will be misleading or altogether inaccurate.Comment: ACL 2022 camera ready versio
Manifold-based Verbalizer Space Re-embedding for Tuning-free Prompt-based Classification
Prompt-based classification adapts tasks to a cloze question format utilizing
the [MASK] token and the filled tokens are then mapped to labels through
pre-defined verbalizers. Recent studies have explored the use of verbalizer
embeddings to reduce labor in this process. However, all existing studies
require a tuning process for either the pre-trained models or additional
trainable embeddings. Meanwhile, the distance between high-dimensional
verbalizer embeddings should not be measured by Euclidean distance due to the
potential for non-linear manifolds in the representation space. In this study,
we propose a tuning-free manifold-based space re-embedding method called
Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for
verbalizer embeddings, which preserves local properties within the same class
as guidance for classification. Experimental results indicate that even without
tuning any parameters, our LLE-INC is on par with automated verbalizers with
parameter tuning. And with the parameter updating, our approach further
enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the
LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification
approach for the hyper-scale language models.Comment: 11 pages, 3 figure
Deep Learning for Embedding and Integrating Multimodal Biomedical Data
Biomedical data is being generated in extremely high throughput and high dimension by technologies in areas ranging from single-cell genomics, proteomics, and transcriptomics (cytometry, single-cell RNA and ATAC sequencing) to neuroscience and cognition (fMRI and PET) to pharmaceuticals (drug perturbations and interactions). These new and emerging technologies and the datasets they create give an unprecedented view into the workings of their respective biological entities. However, there is a large gap between the information contained in these datasets and the insights that current machine learning methods can extract from them. This is especially the case when multiple technologies can measure the same underlying biological entity or system. By separately analyzing the same system but from different views gathered by different data modalities, patterns are left unobserved if they only emerge from the multi-dimensional joint representation of all of the modalities together. Through an interdisciplinary approach that emphasizes active collaboration with data domain experts, my research has developed models for data integration, extracting important insights through the joint analysis of varied data sources. In this thesis, I discuss models that address this task of multi-modal data integration, especially generative adversarial networks (GANs) and autoencoders (AEs). My research has been focused on using both of these models in a generative way for concrete problems in cutting-edge scientific applications rather than the exclusive focus on the generation of high-resolution natural images. The research in this thesis is united around ideas of building models that can extract new knowledge from scientific data inaccessible to currently existing methods
N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding
Deep clustering has increasingly been demonstrating superiority over
conventional shallow clustering algorithms. Deep clustering algorithms usually
combine representation learning with deep neural networks to achieve this
performance, typically optimizing a clustering and non-clustering loss. In such
cases, an autoencoder is typically connected with a clustering network, and the
final clustering is jointly learned by both the autoencoder and clustering
network. Instead, we propose to learn an autoencoded embedding and then search
this further for the underlying manifold. For simplicity, we then cluster this
with a shallow clustering algorithm, rather than a deeper network. We study a
number of local and global manifold learning methods on both the raw data and
autoencoded embedding, concluding that UMAP in our framework is best able to
find the most clusterable manifold in the embedding, suggesting local manifold
learning on an autoencoded embedding is effective for discovering higher
quality discovering clusters. We quantitatively show across a range of image
and time-series datasets that our method has competitive performance against
the latest deep clustering algorithms, including out-performing current
state-of-the-art on several. We postulate that these results show a promising
research direction for deep clustering. The code can be found at
https://github.com/rymc/n2dComment: Accepted at ICPR 202
- …