4,159 research outputs found
Bayesian Grammar Induction for Language Modeling
We describe a corpus-based induction algorithm for probabilistic context-free
grammars. The algorithm employs a greedy heuristic search within a Bayesian
framework, and a post-pass using the Inside-Outside algorithm. We compare the
performance of our algorithm to n-gram models and the Inside-Outside algorithm
in three language modeling tasks. In two of the tasks, the training data is
generated by a probabilistic context-free grammar and in both tasks our
algorithm outperforms the other techniques. The third task involves
naturally-occurring data, and in this task our algorithm does not perform as
well as n-gram models but vastly outperforms the Inside-Outside algorithm.Comment: 8 pages, LaTeX, uses aclap.st
TTS ā A Treebank Tool Suite
Treebanks are important resources in descriptive, theoretical and computational linguistic research, development and teaching. This paper presents a treebank tool suite (TTS) for and derived from the Penn-II treebank resource (Marcus et al, 1993). The tools include treebank inspection and viewing options which support search for CF-PSG rule tokens extracted from the treebank, graphical display of complete trees containing the rule instance, display of subtrees rooted by the rule instance and display of the yield of the subtree (with or without context). The search can be further restricted by constraining the yield to contain particular strings. Rules can be ordered by frequency and the user can set frequency thresholds. To process new text, the tool suite provides a PCFG chart parser (based on the CYK algorithm) operating on CFG grammars extracted from the treebank following the method of (Charniak, 1996) as well as a HMM bi-/trigram tagger trained on the tagged version of the treebank resource. The system is implemented in Java and Perl. We employ the InterArbora module based on the Thistle display engine (LTG, 2001) as our tree grapher
A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors
This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing
approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLEās output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics:
we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence.
We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep
decision tree features is effective. Our evaluation
is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors
into well-formed BNC sentences
- ā¦