57 research outputs found

    A Joint Matrix Factorization Analysis of Multilingual Representations

    Full text link
    We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models. We conduct a large-scale empirical study of over 33 languages and 17 morphosyntactic categories. Our findings demonstrate variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. Hierarchical clustering of the factorization outputs yields a tree structure that is related to phylogenetic trees manually crafted by linguists. Moreover, we find the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks. We release our code to facilitate future research.Comment: Accepted to Findings of EMNLP 202

    Learning informative priors from heterogeneous domains to improve recommendation in cold-start user domains

    Full text link
    © 2016 ACM. In the real-world environment, users have sufficient experience in their focused domains but lack experience in other domains. Recommender systems are very helpful for recommending potentially desirable items to users in unfamiliar domains, and cross-domain collaborative filtering is therefore an important emerging research topic. However, it is inevitable that the cold-start issue will be encountered in unfamiliar domains due to the lack of feedback data. The Bayesian approach shows that priors play an important role when there are insufficient data, which implies that recommendation performance can be significantly improved in cold-start domains if informative priors can be provided. Based on this idea, we propose a Weighted Irregular Tensor Factorization (WITF) model to leverage multi-domain feedback data across all users to learn the cross-domain priors w.r.t. both users and items. The features learned from WITF serve as the informative priors on the latent factors of users and items in terms of weighted matrix factorization models. Moreover, WITF is a unified framework for dealing with both explicit feedback and implicit feedback. To prove the effectiveness of our approach, we studied three typical real-world cases in which a collection of empirical evaluations were conducted on real-world datasets to compare the performance of our model and other state-of-the-art approaches. The results show the superiority of our model over comparison models

    Identification of threats using linguistics-based knowledge extraction.

    Full text link

    Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

    Get PDF
    This chapter presents a method for the recovery of bilingual information based on semidiscrete matrix decomposition (SDD); that is, the problem of retrieving information in two languages, Spanish and English, is studied when the queries are made only in Spanish. In it, four case studies that exhibit the performance of the use of the latent semantic index (LSI) via SDD method for cross-language information retrieval (CLIR) are displayed. Concurrently, these results are compared with those obtained by applying LSI via singular value decomposition (SVD). All experiments were performed from a bilingual database, built from the gospels of the Bible, which combines documents in Spanish and English. For this, a fusion strategy was used that increases the size of the database by 10%. It was found that in terms of errors, the methods are comparable, since equal results were obtained in 58.3% of the queries made. In addition, the methods presented a success rate of at least 65% in the task of retrieving relevant information in the two languages considered
    • …
    corecore