1,930 research outputs found

    Memory-Based Shallow Parsing

    Get PDF
    We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as memory-based modules. The experiments reported in this paper show competitive results, the F-value for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and 79.0% for object detection.Comment: 8 pages, to appear in: Proceedings of the EACL'99 workshop on Computational Natural Language Learning (CoNLL-99), Bergen, Norway, June 199

    A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors

    Get PDF
    This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLE’s output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics: we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence. We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep decision tree features is effective. Our evaluation is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors into well-formed BNC sentences

    Memory-Based Shallow Parsing

    Full text link
    We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement

    Similarity rules! Exploring methods for ad-hoc rule detection

    Get PDF
    We examine the role of similarity in ad hoc rule detection and show how previous methods can be made more corpus independent and more generally applicable. Specifically, we show that the similarity of a rule to others in the grammar is a crucial factor in determining the reliability of a rule, providing information unavailable in frequency. We also include a way to score rules which are not in the training data, thereby providing a platform for grammar generalization

    Is writing style predictive of scientific fraud?

    Get PDF
    The problem of detecting scientific fraud using machine learning was recently introduced, with initial, positive results from a model taking into account various general indicators. The results seem to suggest that writing style is predictive of scientific fraud. We revisit these initial experiments, and show that the leave-one-out testing procedure they used likely leads to a slight over-estimate of the predictability, but also that simple models can outperform their proposed model by some margin. We go on to explore more abstract linguistic features, such as linguistic complexity and discourse structure, only to obtain negative results. Upon analyzing our models, we do see some interesting patterns, though: Scientific fraud, for examples, contains less comparison, as well as different types of hedging and ways of presenting logical reasoning.Comment: To appear in the Proceedings of the Workshop on Stylistic Variation 2017 (EMNLP), 6 page

    Redefining part-of-speech classes with distributional semantic models

    Full text link
    This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of `soft' or `graded' part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features

    Using foreign inclusion detection to improve parsing performance

    Get PDF
    Inclusions from other languages can be a significant source of errors for monolin-gual parsers. We show this for English in-clusions, which are sufficiently frequent to present a problem when parsing German. We describe an annotation-free approach for accurately detecting such inclusions, and de-velop two methods for interfacing this ap-proach with a state-of-the-art parser for Ger-man. An evaluation on the TIGER cor-pus shows that our inclusion entity model achieves a performance gain of 4.3 points in F-score over a baseline of no inclusion de-tection, and even outperforms a parser with access to gold standard part-of-speech tags.
    corecore