Search CORE

10 research outputs found

Transfer Learning for Multi-language Twitter Election Classification

Author: Chandar S.
Ebert S.
Fang A.
O'Connor B.
Ratkiewicz J.
Severyn A.
Voorhees E. M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Both politicians and citizens are increasingly embracing social media as a means to disseminate information and comment on various topics, particularly during significant political events, such as elections. Such commentary during elections is also of interest to social scientists and pollsters. To facilitate the study of social media during elections, there is a need to automatically identify posts that are topically related to those elections. However, current studies have focused on elections within English-speaking regions, and hence the resultant election content classifiers are only applicable for elections in countries where the predominant language is English. On the other hand, as social media is becoming more prevalent worldwide, there is an increasing need for election classifiers that can be generalised across different languages, without building a training dataset for each election. In this paper, based upon transfer learning, we study the development of effective and reusable election classifiers for use on social media across multiple languages. We combine transfer learning with different classifiers such as Support Vector Machines (SVM) and state-of-the-art Convolutional Neural Networks (CNN), which make use of word embedding representations for each social media post. We generalise the learned classifier models for cross-language classification by using a linear translation approach to map the word embedding vectors from one language into another. Experiments conducted over two election datasets in different languages show that without using any training data from the target language, linear translations outperform a classical transfer learning approach, namely Transfer Component Analysis (TCA), by 80% in recall and 25% in F1 measure

Crossref

Enlighten

Training vs Post-training Cross-lingual Word Embedding Approaches: A Comparative Study

Author: Masood Ghayoomi
Publication venue: Regional Information Center for Science and Technology (RICeST)
Publication date: 01/01/2023
Field of study

This paper provides a comparative analysis of cross-lingual word embedding by studying the impact of different variables on the quality of the embedding models within the distributional semantics framework. Distributional semantics is a method for the semantic representation of words, phrases, sentences, and documents. This method aims at capturing as much information as possible from the contextual information in a vector space. The early study in this domain focused on monolingual word embedding. Further progress used cross-lingual data to capture the contextual semantic information across different languages. The main contribution of this research is to make a comparative study to find out the superior impact of the learning methods, supervised and unsupervised in training and post-training approaches in different embedding algorithms, to capture semantic properties of the words in cross-lingual embedding models to be applicable in tasks that deal with multi-languages, such as question retrieval. To this end, we study the cross-lingual embedding models created by BilBOWA, VecMap, and MUSE embedding algorithms along with the variables that impact the embedding models' quality, namely the size of the training data and the window size of the local context. In our study, we use the unsupervised monolingual Word2Vec embedding model as the baseline and evaluate the quality of embeddings on three data sets: Google analogy, mono- and cross-lingual words similar lists. We further investigated the impact of the embedding models in the question retrieval task

Directory of Open Access Journals

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

Author: Maosong Sun
Tianze Shi
Yang Liu
Zhiyuan Liu
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix de-composition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classification show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.

CiteSeerX

Crossref

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

Author: Tianze Shi
Zhiyuan Liu
Yang Liu
Maosong Sun
Publication venue
Publication date: 01/01/2015
Field of study

CiteSeerX

Crossref

University of Debrecen Electronic Archive

An Empirical Work on Stable and Changing Elements in Historical Text Reuse

Author: Berger Maria
Publication venue
Publication date: 02/05/2019
Field of study

Georg-August-University Göttingen

Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision

Author: Cao Yixin
Chen Xu
Dong Tiansi
Hou Lei
Li Chengjiang
Li Juanzi
Liu Zhiyuan
Publication venue
Publication date: 01/01/2018
Field of study

Joint representation learning of words and entities benefits many NLP tasks, but has not been well explored in cross-lingual settings. In this paper, we propose a novel method for joint representation learning of cross-lingual words and entities. It captures mutually complementary knowledge, and enables cross-lingual inferences among knowledge bases and texts. Our method does not require parallel corpora, and automatically generates comparable data via distant supervision using multi-lingual knowledge bases. We utilize two types of regularizers to align cross-lingual words and entities, and design knowledge attention and cross-lingual attention to further reduce noises. We conducted a series of experiments on three tasks: word translation, entity relatedness, and cross-lingual entity linking. The results, both qualitatively and quantitatively, demonstrate the significance of our method.Comment: 11 pages, EMNLP201

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

Concept and entity grounding using indirect supervision

Author: Tsai Chen-Tse
Publication venue
Publication date: 01/08/2017
Field of study

Extracting and disambiguating entities and concepts is a crucial step toward understanding natural language text. In this thesis, we consider the problem of grounding concepts and entities mentioned in text to one or more knowledge bases (KBs). A well-studied scenario of this problem is the one in which documents are given in English and the goal is to identify concept and entity mentions, and find the corresponding entries the mentions refer to in Wikipedia. We extend this problem in two directions: First, we study identifying and grounding entities written in any language to the English Wikipedia. Second, we investigate using multiple KBs which do not contain rich textual and structural information Wikipedia does. These more involved settings pose a few additional challenges beyond those addressed in the standard English Wikification problem. Key among them is that no supervision is available to facilitate training machine learning models. The first extension, cross-lingual Wikification, introduces problems such as recognizing multilingual named entities mentioned in text, translating non-English names into English, and computing word similarity across languages. Since it is impossible to acquire manually annotated examples for all languages, building models for all languages in Wikipedia requires exploring indirect or incidental supervision signals which already exist in Wikipedia. For the second setting, we need to deal with the fact that most KBs do not contain the rich information Wikipedia has; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this thesis, we show that supervision signals can be obtained by carefully examining the redundancy and relations between multiple KBs. By developing algorithms and models which harvest these incidental signals, we can achieve better performance on these tasks

Illinois Digital Environment for Access to Learning and Scholarship Repository