71 research outputs found
Wide-coverage deep statistical parsing using automatic dependency structure annotation
A number of researchers (Lin 1995; Carroll, Briscoe, and Sanfilippo 1998; Carroll et al. 2002; Clark and Hockenmaier 2002; King et al. 2003; Preiss 2003; Kaplan et al. 2004;Miyao and Tsujii 2004) have convincingly argued for the use of dependency (rather than CFG-tree) representations
for parser evaluation. Preiss (2003) and Kaplan et al. (2004) conducted a number of experiments comparing ādeepā hand-crafted wide-coverage with āshallowā treebank- and machine-learning based parsers at the level of dependencies, using simple and automatic methods to convert tree output generated by the shallow parsers into dependencies. In this article, we revisit the experiments
in Preiss (2003) and Kaplan et al. (2004), this time using the sophisticated automatic LFG f-structure annotation methodologies of Cahill et al. (2002b, 2004) and Burke (2006), with surprising results. We compare various PCFG and history-based parsers (based on Collins, 1999; Charniak, 2000; Bikel, 2002) to find a baseline parsing system that fits best into our automatic dependency structure annotation technique. This combined system of syntactic parser and dependency structure annotation is compared to two hand-crafted, deep constraint-based parsers (Carroll and Briscoe 2002; Riezler et al. 2002). We evaluate using dependency-based gold standards (DCU 105, PARC 700, CBS 500 and dependencies for WSJ Section 22) and use the Approximate Randomization Test (Noreen 1989) to test the statistical significance of the results. Our experiments show that machine-learning-based shallow grammars augmented with sophisticated automatic dependency annotation technology outperform hand-crafted, deep, widecoverage constraint grammars. Currently our best system achieves an f-score of 82.73% against the PARC 700 Dependency Bank (King et al. 2003), a statistically significant improvement of 2.18%over the most recent results of 80.55%for the hand-crafted LFG grammar and XLE parsing system of Riezler et al. (2002), and an f-score of 80.23% against the CBS 500 Dependency Bank (Carroll, Briscoe, and Sanfilippo 1998), a statistically significant 3.66% improvement over the 76.57% achieved by the hand-crafted RASP grammar and parsing system of Carroll and
Briscoe (2002)
An Empirical Study of Compound PCFGs
Compound probabilistic context-free grammars (C-PCFGs) have recently
established a new state of the art for phrase-structure grammar induction.
However, due to the high time-complexity of chart-based representation and
inference, it is difficult to investigate them comprehensively. In this work,
we rely on a fast implementation of C-PCFGs to conduct evaluation complementary
to that of~\citet{kim-etal-2019-compound}. We highlight three key findings: (1)
C-PCFGs are data-efficient, (2) C-PCFGs make the best use of global
sentence-level information in preterminal rule probabilities, and (3) the best
configurations of C-PCFGs on English do not always generalize to
morphology-rich languages.Comment: Accepted to Adapt-NLP at EACL 2021. Our code is available at
https://github.com/zhaoyanpeng/cpcf
Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing
One of the limitations of semantic parsing approaches to open-domain question
answering is the lexicosyntactic gap between natural language questions and
knowledge base entries -- there are many ways to ask a question, all with the
same answer. In this paper we propose to bridge this gap by generating
paraphrases of the input question with the goal that at least one of them will
be correctly mapped to a knowledge-base query. We introduce a novel grammar
model for paraphrase generation that does not require any sentence-aligned
paraphrase corpus. Our key idea is to leverage the flexibility and scalability
of latent-variable probabilistic context-free grammars to sample paraphrases.
We do an extrinsic evaluation of our paraphrases by plugging them into a
semantic parser for Freebase. Our evaluation experiments on the WebQuestions
benchmark dataset show that the performance of the semantic parser
significantly improves over strong baselines.Comment: 10 pages, INLG 201
Unsupervised syntactic chunking with acoustic cues: Computational models for prosodic bootstrapping
Learning to group words into phrases without supervision is a hard task for NLP systems, but infants routinely accomplish it. We hypothesize that infants use acoustic cues to prosody, which NLP systems typically ignore. To evaluate the utility of prosodic information for phrase discovery, we present an HMM-based unsupervised chunker that learns from only transcribed words and raw acoustic correlates to prosody. Unlike previous work on unsupervised parsing and chunking, we use neither gold standard part-of-speech tags nor punctuation in the input. Evaluated on the Switchboard corpus, our model outperforms several baselines that exploit either lexical or prosodic information alone, and, despite producing a flat structure, performs competitively with a state-of-the-art unsupervised lexicalized parser, with a substantial advantage in precision. Our results support the hypothesis that acoustic-prosodic cues provide useful evidence about syntactic phrases for language-learning infants.10 page(s
Recommended from our members
Discovering latent structures in syntax trees and mixed-type data
Gibbs sampling is a widely applied algorithm to estimate parameters in statistical models. This thesis uses Gibbs sampling to resolve practical problems, especially on natural language processing and mixed type data. It includes three independent studies. The first study includes a Bayesian model for learning latent annotations. The technique is capable of parsing sentences in a wide variety of languages, producing results that are on-par with or surpass previous approaches in accuracy, and shows promising potential for parsing low-resource languages. The second study presents a method to automatically complete annotations from partially-annotated sentence data, with the help of Gibbs sampling. The algorithm significantly reduces the time required to annotate sentences for natural language processing, without a significant drop in annotation accuracy. The last study proposes a novel factor model for uncovering latent factors and exploring covariation among multiple outcomes of mixed types, including binary, count, and continuous data. Gibbs sampling is used to estimate model parameters. The algorithm successfully discovers correlation structures of mixed-type
data in both simulated and real-word data.Operations Research and Industrial Engineerin
- ā¦