769 research outputs found

    Crosslingual Document Embedding as Reduced-Rank Ridge Regression

    Get PDF
    There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

    Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport

    Full text link
    Selecting input features of top relevance has become a popular method for building self-explaining models. In this work, we extend this selective rationalization approach to text matching, where the goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs. However, directly applying OT often produces dense and therefore uninterpretable alignments. To overcome this limitation, we introduce novel constrained variants of the OT problem that result in highly sparse alignments with controllable sparsity. Our model is end-to-end differentiable using the Sinkhorn algorithm for OT and can be trained without any alignment annotations. We evaluate our model on the StackExchange, MultiNews, e-SNLI, and MultiRC datasets. Our model achieves very sparse rationale selections with high fidelity while preserving prediction accuracy compared to strong attention baseline models.Comment: To appear at ACL 202

    Multilingual topic modelling on news data

    Get PDF

    Improved Techniques for Adversarial Discriminative Domain Adaptation

    Get PDF
    Adversarial discriminative domain adaptation (ADDA) is an efficient framework for unsupervised domain adaptation in image classification, where the source and target domains are assumed to have the same classes, but no labels are available for the target domain. We investigate whether we can improve performance of ADDA with a new framework and new loss formulations. Following the framework of semi-supervised GANs, we first extend the discriminator output over the source classes, in order to model the joint distribution over domain and task. We thus leverage on the distribution over the source encoder posteriors (which is fixed during adversarial training) and propose maximum mean discrepancy (MMD) and reconstruction-based loss functions for aligning the target encoder distribution to the source domain. We compare and provide a comprehensive analysis of how our framework and loss formulations extend over simple multi-class extensions of ADDA and other discriminative variants of semi-supervised GANs. In addition, we introduce various forms of regularization for stabilizing training, including treating the discriminator as a denoising autoencoder and regularizing the target encoder with source examples to reduce overfitting under a contraction mapping (i.e., when the target per-class distributions are contracting during alignment with the source). Finally, we validate our framework on standard domain adaptation datasets, such as SVHN and MNIST. We also examine how our framework benefits recognition problems based on modalities that lack training data, by introducing and evaluating on a neuromorphic vision sensing (NVS) sign language recognition dataset, where the source and target domains constitute emulated and real neuromorphic spike events respectively. Our results on all datasets show that our proposal competes or outperforms the state-of-the-art in unsupervised domain adaptation.Comment: To appear in IEEE Transactions on Image Processin

    Data Visualization, Dimensionality Reduction, and Data Alignment via Manifold Learning

    Get PDF
    The high dimensionality of modern data introduces significant challenges in descriptive and exploratory data analysis. These challenges gave rise to extensive work on dimensionality reduction and manifold learning aiming to provide low dimensional representations that preserve or uncover intrinsic patterns and structures in the data. In this thesis, we expand the current literature in manifold learning developing two methods called DIG (Dynamical Information Geometry) and GRAE (Geometry Regularized Autoencoders). DIG is a method capable of finding low-dimensional representations of high-frequency multivariate time series data, especially suited for visualization. GRAE is a general framework which splices the well-established machinery from kernel manifold learning methods to recover a sensitive geometry, alongside the parametric structure of autoencoders. Manifold learning can also be useful to study data collected from different measurement instruments, conditions, or protocols of the same underlying system. In such cases the data is acquired in a multi-domain representation. The last two Chapters of this thesis are devoted to two new methods capable of aligning multi-domain data, leveraging their geometric structure alongside limited common information. First, we present DTA (Diffusion Transport Alignment), a semi-supervised manifold alignment method that exploits prior one-to-one correspondence knowledge between distinct data views and finds an aligned common representation. And finally, we introduce MALI (Manifold Alignment with Label Information). Here we drop the one-to-one prior correspondences assumption, since in many scenarios such information can not be provided, either due to the nature of the experimental design, or it becomes extremely costly. Instead, MALI only needs side-information in the form of discrete labels/classes present in both domains
    • …
    corecore