1,344 research outputs found
Finding Relevant Answers in Software Forums
AbstractāOnline software forums provide a huge amount of valuable content. Developers and users often ask questions and receive answers from such forums. The availability of a vast amount of thread discussions in forums provides ample opportunities for knowledge acquisition and summarization. For a given search query, current search engines use traditional information retrieval approach to extract webpages containin
Modeling Dependencies in Natural Languages with Latent Variables
In this thesis, we investigate the use of latent variables to model complex dependencies in natural languages. Traditional models, which have a fixed parameterization, often make strong independence assumptions that lead to poor performance. This problem is often addressed by incorporating additional dependencies into the model (e.g., using higher order N-grams for language modeling). These added dependencies can increase data sparsity and/or require expert knowledge, together with trial and error, in order to identify and incorporate the most important dependencies (as in lexicalized parsing models). Traditional models, when developed for a particular genre, domain, or language, are also often difficult to adapt to another.
In contrast, previous work has shown that latent variable models, which automatically learn dependencies in a data-driven way, are able to flexibly adjust the number of parameters based on the type and the amount of training data available. We have created several different types of latent variable models for a diverse set of natural language processing applications, including novel models for part-of-speech tagging, language modeling, and machine translation, and an improved model for parsing. These models perform significantly better than traditional models. We have also created and evaluated three different methods for improving the performance of latent variable models. While these methods can be applied to any of our applications, we focus our experiments on parsing.
The first method involves self-training, i.e., we train models using a combination of gold standard training data and a large amount of automatically labeled training data. We conclude from a series of experiments that the latent variable models benefit much more from self-training than conventional models, apparently due to their flexibility to adjust their model parameterization to learn more accurate models from the additional automatically labeled training data.
The second method takes advantage of the variability among latent variable models to combine multiple models for enhanced performance. We investigate several different training protocols to combine self-training with model combination. We conclude that these two techniques are complementary to each other and can be effectively combined to train very high quality parsing models.
The third method replaces the generative multinomial lexical model of latent variable grammars with a feature-rich log-linear lexical model to provide a principled solution to address data sparsity, handle out-of-vocabulary words, and exploit overlapping features during model induction. We conclude from experiments that the resulting grammars are able to effectively parse three different languages.
This work contributes to natural language processing by creating flexible and effective latent variable models for several different languages. Our investigation of self-training, model combination, and log-linear models also provides insights into the effective application of these machine learning techniques to other disciplines
Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when longrange dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edgesāa distributed state representation as in dynamic Bayesian networks (DBNs)āand parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linearchain CRFs, achieving comparable performance using only half the training data
mARC: Memory by Association and Reinforcement of Contexts
This paper introduces the memory by Association and Reinforcement of Contexts
(mARC). mARC is a novel data modeling technology rooted in the second
quantization formulation of quantum mechanics. It is an all-purpose incremental
and unsupervised data storage and retrieval system which can be applied to all
types of signal or data, structured or unstructured, textual or not. mARC can
be applied to a wide range of information clas-sification and retrieval
problems like e-Discovery or contextual navigation. It can also for-mulated in
the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast
to Conway approach, the objects evolve in a massively multidimensional space.
In order to start evaluating the potential of mARC we have built a mARC-based
Internet search en-gine demonstrator with contextual functionality. We compare
the behavior of the mARC demonstrator with Google search both in terms of
performance and relevance. In the study we find that the mARC search engine
demonstrator outperforms Google search by an order of magnitude in response
time while providing more relevant results for some classes of queries
Knowledge Discovery from Financial Text
The abundance of on-line electronic financial news articles has opened up new possibilities for intelligent systems that could extract and organize relevant knowledge automatically in a usable format. While most typical information extraction systems require a hand-built dictionary of templates and, subsequently, are subject to ceaseless modification to accommodate new patterns that are observed in the text, in this research, we propose a novel text-based decision support system (DSS) that will (i) extract event sequences from shallow text patterns and (ii) predict the likelihood of the occurrence of events using a classifier-based inference engine. We investigated more than 2,000 financial reports with 28,000 sentences. Experiments show the DSS outperforms other similar statistical models
Structure Regularization for Structured Prediction: Theories and Experiments
While there are many studies on weight regularization, the study on structure
regularization is rare. Many existing systems on structured prediction focus on
increasing the level of structural dependencies within the model. However, this
trend could have been misdirected, because our study suggests that complex
structures are actually harmful to generalization ability in structured
prediction. To control structure-based overfitting, we propose a structure
regularization framework via \emph{structure decomposition}, which decomposes
training samples into mini-samples with simpler structures, deriving a model
with better generalization power. We show both theoretically and empirically
that structure regularization can effectively control overfitting risk and lead
to better accuracy. As a by-product, the proposed method can also substantially
accelerate the training speed. The method and the theoretical results can apply
to general graphical models with arbitrary structures. Experiments on
well-known tasks demonstrate that our method can easily beat the benchmark
systems on those highly-competitive tasks, achieving state-of-the-art
accuracies yet with substantially faster training speed
- ā¦