10 research outputs found

    Transfer Learning for Multi-language Twitter Election Classification

    Get PDF
    Both politicians and citizens are increasingly embracing social media as a means to disseminate information and comment on various topics, particularly during significant political events, such as elections. Such commentary during elections is also of interest to social scientists and pollsters. To facilitate the study of social media during elections, there is a need to automatically identify posts that are topically related to those elections. However, current studies have focused on elections within English-speaking regions, and hence the resultant election content classifiers are only applicable for elections in countries where the predominant language is English. On the other hand, as social media is becoming more prevalent worldwide, there is an increasing need for election classifiers that can be generalised across different languages, without building a training dataset for each election. In this paper, based upon transfer learning, we study the development of effective and reusable election classifiers for use on social media across multiple languages. We combine transfer learning with different classifiers such as Support Vector Machines (SVM) and state-of-the-art Convolutional Neural Networks (CNN), which make use of word embedding representations for each social media post. We generalise the learned classifier models for cross-language classification by using a linear translation approach to map the word embedding vectors from one language into another. Experiments conducted over two election datasets in different languages show that without using any training data from the target language, linear translations outperform a classical transfer learning approach, namely Transfer Component Analysis (TCA), by 80% in recall and 25% in F1 measure

    Training vs Post-training Cross-lingual Word Embedding Approaches: A Comparative Study

    Get PDF
    This paper provides a comparative analysis of cross-lingual word embedding by studying the impact of different variables on the quality of the embedding models within the distributional semantics framework. Distributional semantics is a method for the semantic representation of words, phrases, sentences, and documents. This method aims at capturing as much information as possible from the contextual information in a vector space. The early study in this domain focused on monolingual word embedding. Further progress used cross-lingual data to capture the contextual semantic information across different languages. The main contribution of this research is to make a comparative study to find out the superior impact of the learning methods, supervised and unsupervised in training and post-training approaches in different embedding algorithms, to capture semantic properties of the words in cross-lingual embedding models to be applicable in tasks that deal with multi-languages, such as question retrieval. To this end, we study the cross-lingual embedding models created by BilBOWA, VecMap, and MUSE embedding algorithms along with the variables that impact the embedding models' quality, namely the size of the training data and the window size of the local context. In our study, we use the unsupervised monolingual Word2Vec embedding model as the baseline and evaluate the quality of embeddings on three data sets: Google analogy, mono- and cross-lingual words similar lists. We further investigated the impact of the embedding models in the question retrieval task

    Learning Cross-lingual Word Embeddings via Matrix Co-factorization

    No full text
    A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix de-composition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classification show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.

    Learning Cross-lingual Word Embeddings via Matrix Co-factorization

    No full text
    A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix de-composition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classification show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.

    Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision

    Full text link
    Joint representation learning of words and entities benefits many NLP tasks, but has not been well explored in cross-lingual settings. In this paper, we propose a novel method for joint representation learning of cross-lingual words and entities. It captures mutually complementary knowledge, and enables cross-lingual inferences among knowledge bases and texts. Our method does not require parallel corpora, and automatically generates comparable data via distant supervision using multi-lingual knowledge bases. We utilize two types of regularizers to align cross-lingual words and entities, and design knowledge attention and cross-lingual attention to further reduce noises. We conducted a series of experiments on three tasks: word translation, entity relatedness, and cross-lingual entity linking. The results, both qualitatively and quantitatively, demonstrate the significance of our method.Comment: 11 pages, EMNLP201

    Concept and entity grounding using indirect supervision

    Get PDF
    Extracting and disambiguating entities and concepts is a crucial step toward understanding natural language text. In this thesis, we consider the problem of grounding concepts and entities mentioned in text to one or more knowledge bases (KBs). A well-studied scenario of this problem is the one in which documents are given in English and the goal is to identify concept and entity mentions, and find the corresponding entries the mentions refer to in Wikipedia. We extend this problem in two directions: First, we study identifying and grounding entities written in any language to the English Wikipedia. Second, we investigate using multiple KBs which do not contain rich textual and structural information Wikipedia does. These more involved settings pose a few additional challenges beyond those addressed in the standard English Wikification problem. Key among them is that no supervision is available to facilitate training machine learning models. The first extension, cross-lingual Wikification, introduces problems such as recognizing multilingual named entities mentioned in text, translating non-English names into English, and computing word similarity across languages. Since it is impossible to acquire manually annotated examples for all languages, building models for all languages in Wikipedia requires exploring indirect or incidental supervision signals which already exist in Wikipedia. For the second setting, we need to deal with the fact that most KBs do not contain the rich information Wikipedia has; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this thesis, we show that supervision signals can be obtained by carefully examining the redundancy and relations between multiple KBs. By developing algorithms and models which harvest these incidental signals, we can achieve better performance on these tasks
    corecore