90 research outputs found
A Reranking Approach for Dependency Parsing with Variable-sized Subtree Features
Employing higher-order subtree structures in graph-based dependency parsing has shown substantial improvement over the accuracy, however suffers from the inefficiency increasing with the order of subtrees. We present a new reranking approach for dependency parsing that can utilize complex subtree representation by applying efficient subtree selection heuristics. We demonstrate the effective-ness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system improves the baseline accuracy from 91.88 % to 93.37 % for English, and in the case of Chinese from 87.39 % to 89.16%. 1
Parser lexicalisation through self-learning
We describe a new self-learning framework for parser lexicalisation that requires only a plain-text corpus of in-domain text. The method first creates augmented versions of dependency graphs by applying a series of modifications designed to directly capture higherorder lexical path dependencies. Scores are assigned to each edge in the graph using statistics from an automatically parsed background corpus. As bilexical dependencies are sparse, a novel directed distributional word similarity measure is used to smooth edge score estimates. Edge scores are then combined into graph scores and used for reranking the topn analyses found by the unlexicalised parser. The approach achieves significant improvements on WSJ and biomedical text over the unlexicalised baseline parser, which is originally trained on a subset of the Brown corpus
From news to comment: Resources and benchmarks for parsing the language of web 2.0
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers
Coordinate noun phrase disambiguation in a generative parsing model
In this paper we present methods for improving the disambiguation of noun phrase (NP) coordination within the framework of a lexicalised history-based parsing model. As
well as reducing noise in the data, we look at modelling two main sources of information for disambiguation: symmetry in conjunct structure, and the dependency between conjunct lexical heads. Our changes to the baseline model result in an increase in NP coordination dependency f-score from 69.9% to
73.8%, which represents a relative reduction in f-score error of 13%
Optimizing Spectral Learning for Parsing
We describe a search algorithm for optimizing the number of latent states
when estimating latent-variable PCFGs with spectral methods. Our results show
that contrary to the common belief that the number of latent states for each
nonterminal in an L-PCFG can be decided in isolation with spectral methods,
parsing results significantly improve if the number of latent states for each
nonterminal is globally optimized, while taking into account interactions
between the different nonterminals. In addition, we contribute an empirical
analysis of spectral algorithms on eight morphologically rich languages:
Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. Our
results show that our estimation consistently performs better or close to
coarse-to-fine expectation-maximization techniques for these languages.Comment: 11 pages, ACL 201
- ā¦