13,817 research outputs found
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Semi-Supervised Learning For Identifying Opinions In Web Content
Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented
Topic Uncovering and Image Annotation via Scalable Probit Normal Correlated Topic Models
Topic uncovering of the latent topics have become an active research area for more than a decade and continuous to receive contributions from all disciplines including computer science, information science and statistics. Since the introduction of Latent Dirichlet Allocation in 2003, many intriguing extension models have been proposed. One such extension model is the logistic normal correlated topic model, which not only uncovers hidden topic of a document, but also extract a meaningful topical relationship among a large number of topics. In this model, the Logistic normal distribution was adapted via the transformation of multivariate Gaussian variables to model the topical distribution of documents in the presence of correlations among topics. In this thesis, we propose a Probit normal alternative approach to modelling correlated topical structures. Our use of the Probit model in the context of topic discovery is novel, as many authors have so far concentrated solely of the logistic model partly due to the formidable inefficiency of the multinomial Probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial Probit estimation by using an adaptation of the Diagonal Orthant Multinomial Probit (DO-Probit) in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. In addition, we extended our model and implement it into the context of image annotation by developing an efficient Collapsed Gibbs Sampling scheme. Furthermore, we employed various high performance computing techniques such as memory-aware Map Reduce, SpareseLDA implementation, vectorization and block sampling as well as some numerical efficiency strategy to allow fast and efficient sampling of our algorithm
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Dynamic topic adaptation for improved contextual modelling in statistical machine translation
In recent years there has been an increased interest in domain adaptation techniques
for statistical machine translation (SMT) to deal with the growing amount of data from
different sources. Topic modelling techniques applied to SMT are closely related to the
field of domain adaptation but more flexible in dealing with unstructured text. Topic
models can capture latent structure in texts and are therefore particularly suitable for
modelling structure in between and beyond corpus boundaries, which are often arbitrary.
In this thesis, the main focus is on dynamic translation model adaptation to texts of
unknown origin, which is a typical scenario for an online MT engine translating web
documents. We introduce a new bilingual topic model for SMT that takes the entire
document context into account and for the first time directly estimates topic-dependent
phrase translation probabilities in a Bayesian fashion. We demonstrate our model’s
ability to improve over several domain adaptation baselines and further provide evidence
for the advantages of bilingual topic modelling for SMT over the more common
monolingual topic modelling. We also show improved performance when deriving further
adapted translation features from the same model which measure different aspects
of topical relatedness.
We introduce another new topic model for SMT which exploits the distributional
nature of phrase pair meaning by modelling topic distributions over phrase pairs using
their distributional profiles. Using this model, we explore combinations of local and
global contextual information and demonstrate the usefulness of different levels of contextual
information, which had not been previously examined for SMT. We also show
that combining this model with a topic model trained at the document-level further improves
performance. Our dynamic topic adaptation approach performs competitively
in comparison with two supervised domain-adapted systems.
Finally, we shed light on the relationship between domain adaptation and topic
adaptation and propose to combine multi-domain adaptation and topic adaptation in a
framework that entails automatic prediction of domain labels at the document level.
We show that while each technique provides complementary benefits to the overall
performance, there is an amount of overlap between domain and topic adaptation. This
can be exploited to build systems that require less adaptation effort at runtime
- …