73 research outputs found
Automatic Extraction of Subcategorization from Corpora
We describe a novel technique and implemented system for constructing a
subcategorization dictionary from textual corpora. Each dictionary entry
encodes the relative frequency of occurrence of a comprehensive set of
subcategorization classes for English. An initial experiment, on a sample of 14
verbs which exhibit multiple complementation patterns, demonstrates that the
technique achieves accuracy comparable to previous approaches, which are all
limited to a highly restricted set of subcategorization classes. We also
demonstrate that a subcategorization dictionary built with the system improves
the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9
Can Subcategorisation Probabilities Help a Statistical Parser?
Research into the automatic acquisition of lexical information from corpora
is starting to produce large-scale computational lexicons containing data on
the relative frequencies of subcategorisation alternatives for individual
verbal predicates. However, the empirical question of whether this type of
frequency information can in practice improve the accuracy of a statistical
parser has not yet been answered. In this paper we describe an experiment with
a wide-coverage statistical grammar and parser for English and
subcategorisation frequencies acquired from ten million words of text which
shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st
Re-estimation of Lexical Parameters for Treebank PCFGs
We present procedures which pool lexical information estimated from unlabeled data via the Inside-Outside algorithm, with lexical information from a treebank PCFG. The procedures produce substantial improvements (up to 31.6 % error reduction) on the task of determining subcategorization frames of novel verbs, relative to a smoothed Penn Treebank-trained PCFG. Even with relatively small quantities of unlabeled training data, the re-estimated models show promising improvements in labeled bracketing f-scores on Wall Street Journal parsing, and substantial benefit in acquiring the subcategorization preferences of low-frequency verbs.
Disambiguating Nouns, Verbs, and Adjectives Using Automatically Acquired Selectional Preferences
Selectional preferences have been used by word sense disambiguation (WSD) systems as one source of disambiguating information. We evaluate WSD using selectional preferences acquired for English adjective—noun, subject, and direct object grammatical relationships with respect to a standard test corpus. The selectional preferences are specific to verb or adjective classes, rather than individual word forms, so they can be used to disambiguate the co-occurring adjectives and verbs, rather than just the nominal argument heads. We also investigate use of the one-senseper-discourse heuristic to propagate a sense tag for a word to other occurrences of the same word within the current document in order to increase coverage. Although the preferences perform well in comparison with other unsupervised WSD systems on the same corpus, the results show that for many applications, further knowledge sources would be required to achieve an adequate level of accuracy and coverage. In addition to quantifying performance, we analyze the results to investigate the situations in which the selectional preferences achieve the best precision and in which the one-sense-per-discourse heuristic increases performance
Enforcing Subcategorization Constraints in a Parser Using Sub-parses Recombining
International audienceTreebanks are not large enough to adequately model subcategorization frames of predicative lexemes, which is an important source of lexico-syntactic constraints for parsing. As a consequence, parsers trained on such treebanks usually make mistakes when selecting the arguments of predicative lexemes. In this paper, we propose an original way to correct subcategorization errors by combining sub-parses of a sentence S that appear in the list of the n-best parses of S. The subcategorization information comes from three different resources, the first one is extracted from a treebank, the second one is computed on a large corpora and the third one is an existing syntactic lexicon. Experiments on the French Treebank showed a 15.24% reduction of erroneous subcategorization frames (SF) selections for verbs as well as a relative decrease of the error rate of 4% Labeled Accuracy Score on the state of the art parser on this treebank
- …