7,587 research outputs found
Automating Grammar Comparison
We consider from a practical perspective the problem of checking equivalence of context-free grammars. We present techniques for proving equivalence, as well as techniques for finding counter-examples that establish non-equivalence. Among the key building blocks of our approach is a novel algorithm for efficiently enumerating and sampling words and parse trees from arbitrary context-free grammars; the algorithm supports polynomial time random access to words belonging to the grammar. Furthermore, we propose an algorithm for proving equivalence of context-free grammars that is complete for LL grammars, yet can be invoked on any context-free grammar, including ambiguous grammars. Our techniques successfully find discrepancies between different syntax specifications of several real-world languages, and are capable of detecting fine-grained incremental modifications performed on grammars. Our evaluation shows that our tool improves significantly on the existing available state of the art tools. In addition, we used these algorithms to develop an online tutoring system for grammars that we then used in an undergraduate course on computer language processing. On questions involving grammar constructions, our system was able to automatically evaluate the correctness of 95% of the solutions submitted by students: it disproved 74% of cases and proved 21% of the
Active learning and the Irish treebank
We report on our ongoing work in developing the Irish Dependency Treebank, describe the results of two Inter annotator Agreement (IAA) studies, demonstrate improvements in annotation consistency which have a knock-on effect on parsing accuracy, and present the final set of dependency labels. We then go on to investigate the extent to which active learning can play a role in treebank and parser development by comparing an active learning bootstrapping approach to a passive approach in which sentences are chosen at random for manual revision. We show that active learning outperforms passive learning, but when annotation effort is taken into account, it is not clear how much of an advantage the active learning approach has. Finally, we present results which suggest that adding automatic parses to the training data along with manually revised parses in an active learning setup does not greatly affect parsing accuracy
Automating Metadata Extraction: Genre Classification
A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.
Cue Phrase Classification Using Machine Learning
Cue phrases may be used in a discourse sense to explicitly signal discourse
structure, but also in a sentential sense to convey semantic rather than
structural information. Correctly classifying cue phrases as discourse or
sentential is critical in natural language processing systems that exploit
discourse structure, e.g., for performing tasks such as anaphora resolution and
plan recognition. This paper explores the use of machine learning for
classifying cue phrases as discourse or sentential. Two machine learning
programs (Cgrendel and C4.5) are used to induce classification models from sets
of pre-classified cue phrases and their features in text and speech. Machine
learning is shown to be an effective technique for not only automating the
generation of classification models, but also for improving upon previous
results. When compared to manually derived classification models already in the
literature, the learned models often perform with higher accuracy and contain
new linguistic insights into the data. In addition, the ability to
automatically construct classification models makes it easier to comparatively
analyze the utility of alternative feature representations of the data.
Finally, the ease of retraining makes the learning approach more scalable and
flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st
- âŠ