1,930 research outputs found
Memory-Based Shallow Parsing
We present a memory-based learning (MBL) approach to shallow parsing in which
POS tagging, chunking, and identification of syntactic relations are formulated
as memory-based modules. The experiments reported in this paper show
competitive results, the F-value for the Wall Street Journal (WSJ) treebank is:
93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and
79.0% for object detection.Comment: 8 pages, to appear in: Proceedings of the EACL'99 workshop on
Computational Natural Language Learning (CoNLL-99), Bergen, Norway, June 199
A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors
This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing
approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLE’s output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics:
we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence.
We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep
decision tree features is effective. Our evaluation
is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors
into well-formed BNC sentences
Memory-Based Shallow Parsing
We present memory-based learning approaches to shallow parsing and apply
these to five tasks: base noun phrase identification, arbitrary base phrase
recognition, clause detection, noun phrase parsing and full parsing. We use
feature selection techniques and system combination methods for improving the
performance of the memory-based learner. Our approach is evaluated on standard
data sets and the results are compared with that of other systems. This reveals
that our approach works well for base phrase identification while its
application towards recognizing embedded structures leaves some room for
improvement
Similarity rules! Exploring methods for ad-hoc rule detection
We examine the role of similarity in ad hoc rule detection and show how previous methods can be made more corpus independent and more generally applicable. Specifically, we show that the similarity of a rule to others in the grammar is a crucial factor in determining the reliability of a rule, providing information unavailable in frequency. We also include a way to score rules which are not in the training data, thereby providing a platform for grammar generalization
Is writing style predictive of scientific fraud?
The problem of detecting scientific fraud using machine learning was recently
introduced, with initial, positive results from a model taking into account
various general indicators. The results seem to suggest that writing style is
predictive of scientific fraud. We revisit these initial experiments, and show
that the leave-one-out testing procedure they used likely leads to a slight
over-estimate of the predictability, but also that simple models can outperform
their proposed model by some margin. We go on to explore more abstract
linguistic features, such as linguistic complexity and discourse structure,
only to obtain negative results. Upon analyzing our models, we do see some
interesting patterns, though: Scientific fraud, for examples, contains less
comparison, as well as different types of hedging and ways of presenting
logical reasoning.Comment: To appear in the Proceedings of the Workshop on Stylistic Variation
2017 (EMNLP), 6 page
Redefining part-of-speech classes with distributional semantic models
This paper studies how word embeddings trained on the British National Corpus
interact with part of speech boundaries. Our work targets the Universal PoS tag
set, which is currently actively being used for annotation of a range of
languages. We experiment with training classifiers for predicting PoS tags for
words based on their embeddings. The results show that the information about
PoS affiliation contained in the distributional vectors allows us to discover
groups of words with distributional patterns that differ from other words of
the same part of speech.
This data often reveals hidden inconsistencies of the annotation process or
guidelines. At the same time, it supports the notion of `soft' or `graded' part
of speech affiliations. Finally, we show that information about PoS is
distributed among dozens of vector components, not limited to only one or two
features
Using foreign inclusion detection to improve parsing performance
Inclusions from other languages can be a significant source of errors for monolin-gual parsers. We show this for English in-clusions, which are sufficiently frequent to present a problem when parsing German. We describe an annotation-free approach for accurately detecting such inclusions, and de-velop two methods for interfacing this ap-proach with a state-of-the-art parser for Ger-man. An evaluation on the TIGER cor-pus shows that our inclusion entity model achieves a performance gain of 4.3 points in F-score over a baseline of no inclusion de-tection, and even outperforms a parser with access to gold standard part-of-speech tags.
- …