37 research outputs found
Summarization Approaches Based on Document Probability Distributions
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Topic Modeling and Text Analysis for Qualitative Policy Research
This paper contributes to a critical methodological discussion that has direct ramifications for policy studies: how computational methods can be concretely incorporated into existing processes of textual analysis and interpretation without compromising scientific integrity. We focus on the computational method of topic modeling and investigate how it interacts with two larger families of qualitative methods: content and classification methods characterized by interest in words as communication units and discourse and representation methods characterized by interest in the meaning of communicative acts. Based on analysis of recent academic publications that have used topic modeling for textual analysis, our findings show that different mixedâmethod research designs are appropriate when combining topic modeling with the two groups of methods. Our main concluding argument is that topic modeling enables scholars to apply policy theories and concepts to much larger sets of data. That said, the use of computational methods requires genuine understanding of these techniques to obtain substantially meaningful results. We encourage policy scholars to reflect carefully on methodological issues, and offer a simple heuristic to help identify and address critical points when designing a study using topic modeling.Peer reviewe
Discriminative Interlingual Representations
The language barrier in many multilingual natural language processing (NLP) tasks can be overcome by mapping objects from different languages (âviewsâ) into a common low-dimensional subspace. For example, the name transliteration task involves mapping bilingual names and word translation mining involves mapping bilingual words into a common low-dimensional subspace. Multi-view models learn such a low-dimensional subspace using a training corpus of paired objects, e.g., names written in different languages, represented as feature vectors. The central idea of my dissertation is to learn low-dimensional subspaces (or interlingual representations) that are effective for various multilingual and monolingual NLP tasks. First, I demonstrate the effectiveness of interlingual representations in mining bilingual word translations, and then proceed to developing models for diverse situations that often arise in NLP tasks. In particular, I design models for the following problem settings: 1) when there are more than two views but we only have training data from a single pivot view into each of the remaining views 2) when an object from one view is associated with a ranked list of objects from another view, and finally 3) when the underlying objects have rich structure, such as a tree. These problem settings arise often in real world applications. I choose a canonical task for eac
Cross-Lingual Information Retrieval System for Indian Languages
This paper describes our first participation in the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF 1 competition. In this track, the task is to retrieve relevant documents from an English corpus in response to a query expressed in different Indian languages including Hindi, Tamil, Telugu, Bengali and Marathi. Groups participating in this track are required to submit a English to English monolingual run and a Hindi to English bilingual run with optional runs in rest of the languages. We had submitted a monolingual English run and a Hindi to English cross-lingual run. We used a word alignment table that was learnt by a Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to map a query in source language into an equivalent query in the language of the target document collection. The relevant documents are then retrieved using a Language Modeling based retrieval algorithm. On CLEF 2007 data set, our official cross-lingual performance was 54.4% of the monolingual performance and in the post submission experiments we found that it can be significantly improved up to 73.4%
Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations. 1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29 % accuracy and an average improvement of 17 % accuracy compared to a state-of-the-art baseline system.