34 research outputs found

    Discriminative Features via Generalized Eigenvectors

    Full text link
    Representing examples in a way that is compatible with the underlying classifier can greatly enhance the performance of a learning system. In this paper we investigate scalable techniques for inducing discriminative features by taking advantage of simple second order structure in the data. We focus on multiclass classification and show that features extracted from the generalized eigenvectors of the class conditional second moments lead to classifiers with excellent empirical performance. Moreover, these features have attractive theoretical properties, such as inducing representations that are invariant to linear transformations of the input. We evaluate classifiers built from these features on three different tasks, obtaining state of the art results

    Cross-lingual Distillation for Text Classification

    Full text link
    Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts and extends a framework originally proposed for model compression. Using soft probabilistic predictions for the documents in a label-rich language as the (induced) supervisory labels in a parallel corpus of documents, we train classifiers successfully for new languages in which labeled training data are not available. An adversarial feature adaptation technique is also applied during the model training to reduce distribution mismatch. We conducted experiments on two benchmark CLTC datasets, treating English as the source language and German, French, Japan and Chinese as the unlabeled target languages. The proposed approach had the advantageous or comparable performance of the other state-of-art methods.Comment: Accepted at ACL 2017; Code available at https://github.com/xrc10/cross-distil

    Learning Semantic Representations for the Phrase Translation Model

    Get PDF
    This paper presents a novel semantic-based phrase translation model. A pair of source and target phrases are projected into continuous-valued vector representations in a low-dimensional latent semantic space, where their translation score is computed by the distance between the pair in this new space. The projection is performed by a multi-layer neural network whose weights are learned on parallel training data. The learning is aimed to directly optimize the quality of end-to-end machine translation results. Experimental evaluation has been performed on two Europarl translation tasks, English-French and German-English. The results show that the new semantic-based phrase translation model significantly improves the performance of a state-of-the-art phrase-based statistical machine translation sys-tem, leading to a gain of 0.7-1.0 BLEU points

    Continuous Space Models for CLIR

    Full text link
    [EN] We present and evaluate a novel technique for learning cross-lingual continuous space models to aid cross-language information retrieval (CLIR). Our model, which is referred to as external-data composition neural network (XCNN), is based on a composition function that is implemented on top of a deep neural network that provides a distributed learning framework. Different from most existing models, which rely only on available parallel data for training, our learning framework provides a natural way to exploit monolingual data and its associated relevance metadata for learning continuous space representations of language. Cross-language extensions of the obtained models can then be trained by using a small set of parallel data. This property is very helpful for resource-poor languages, therefore, we carry out experiments on the English-Hindi language pair. On the conducted comparative evaluation, the proposed model is shown to outperform state-of-the-art continuous space models with statistically significant margin on two different tasks: parallel sentence retrieval and ad-hoc retrieval.We thank German Sanchis Trilles for helping in conducting experiments with machine translation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan GPU used for this research. The research of the first author was supported by FPI grant of UPV. The research of the third author is supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (PrometeolI/2014/030).Gupta, P.; Banchs, R.; Rosso, P. (2017). Continuous Space Models for CLIR. Information Processing & Management. 53(2):359-370. https://doi.org/10.1016/j.ipm.2016.11.002S35937053

    Zero-shot language transfer for cross-lingual sentence retrieval using bidirectional attention model

    Get PDF
    We present a neural architecture for cross-lingual mate sentence retrieval which encodes sentences in a joint multilingual space and learns to distinguish true translation pairs from semantically related sentences across languages. The proposed model combines a recurrent sequence encoder with a bidirectional attention layer and an intra-sentence attention mechanism. This way the final fixed-size sentence representations in each training sentence pair depend on the selection of contextualized token representations from the other sentence. The representations of both sentences are then combined using the bilinear product function to predict the relevance score. We show that, coupled with a shared multilingual word embedding space, the proposed model strongly outperforms unsupervised cross-lingual ranking functions, and that further boosts can be achieved by combining the two approaches. Most importantly, we demonstrate the model's effectiveness in zero-shot language transfer settings: our multilingual framework boosts cross-lingual sentence retrieval performance for unseen language pairs without any training examples. This enables robust cross-lingual sentence retrieval also for pairs of resource-lean languages, without any parallel data

    Cross-language high similarity search using a conceptual thesaurus

    Full text link
    This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially funded by the European Commission as part of the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise 2.0 research project (TIN2009-13391-C04-03). The research work of the second author is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748

    Squeezing Bottlenecks: Exploring the Limits of Autoencoder Semantic Representation Capabilities

    Full text link
    This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing 175 (2016) 1001–1008. DOI 10.1016/j.neucom.2015.06.091.[EN] We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the various issues. We explore the suitability of two different models binary deep autencoders (bDA) and replicated-softmax deep autencoders (rsDA) for constructing deep autoencoders for text data at the sentence level. We propose and evaluate two novel metrics for better assessing the text-reconstruction capabilities of autoencoders. We propose an automatic method to find the critical bottleneck dimensionality for text representations (below which structural information is lost); and finally we conduct a comparative evaluation across different languages, exploring the regions of critical bottleneck dimensionality and its relationship to language perplexity. & 2015 Elsevier B.V. All rights reserved.A significant part of this research work was conducted during the first author's attachment to the HLT department of I2R in Singapore. The work of the first and third authors was carried out in the framework of the WIQ-EI IRSES project (Grant no. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Gupta, PA.; Banchs, R.; Rosso, P. (2016). Squeezing Bottlenecks: Exploring the Limits of Autoencoder Semantic Representation Capabilities. Neurocomputing. 175:1001-1008. https://doi.org/10.1016/j.neucom.2015.06.091S1001100817
    corecore