5,962 research outputs found
All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts
and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten
different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information
A Joint Model for Definition Extraction with Syntactic Connection and Semantic Consistency
Definition Extraction (DE) is one of the well-known topics in Information
Extraction that aims to identify terms and their corresponding definitions in
unstructured texts. This task can be formalized either as a sentence
classification task (i.e., containing term-definition pairs or not) or a
sequential labeling task (i.e., identifying the boundaries of the terms and
definitions). The previous works for DE have only focused on one of the two
approaches, failing to model the inter-dependencies between the two tasks. In
this work, we propose a novel model for DE that simultaneously performs the two
tasks in a single framework to benefit from their inter-dependencies. Our model
features deep learning architectures to exploit the global structures of the
input sentences as well as the semantic consistencies between the terms and the
definitions, thereby improving the quality of the representation vectors for
DE. Besides the joint inference between sentence classification and sequential
labeling, the proposed model is fundamentally different from the prior work for
DE in that the prior work has only employed the local structures of the input
sentences (i.e., word-to-word relations), and not yet considered the semantic
consistencies between terms and definitions. In order to implement these novel
ideas, our model presents a multi-task learning framework that employs graph
convolutional neural networks and predicts the dependency paths between the
terms and the definitions. We also seek to enforce the consistency between the
representations of the terms and definitions both globally (i.e., increasing
semantic consistency between the representations of the entire sentences and
the terms/definitions) and locally (i.e., promoting the similarity between the
representations of the terms and the definitions)
- …