Evaluation of features for predicting document difficulty

Abstract

Knowing the difficulty of a text document, in particular learning materials, has many benefits, such as recommending documents that are tailored towards a specific target group with the goal of maximizing understanding when reading these recommended documents. While different factors exist that affect document difficulty, they capture different aspects of it. One of which is readability, which captures syntactical and lexical text properties and relates to linguistic difficulty. Another one is the background knowledge needed for readers to understand a given document because concepts therein might be more or less complex. Although both factors have been analyzed in isolation, their interplay is unknown. Similarly, the importance of both factors has not been examined, although addressing any of those problems could improve the understanding of document difficulty and thus pave the way towards more reliable models for predicting document difficulty. Hence, this work investigates both problems by proposing a supervised model that extracts 20 features related to background knowledge and readability of a document to predict its difficulty. This model serves as the basis for analyzing the importance of these features and the interplay between background knowledge and readability for estimating document difficulty. We find that linguistic difficulty is more important than background knowledge across all datasets. To the best of our knowledge, there are no datasets in the educational domain available for predicting document difficulty, thus we created one about biological concepts. We release this dataset to the research community in the hope to stimulate more research and provide more data to assess the reliability of methods for predicting document difficulty across different domains

    Similar works

    Full text

    thumbnail-image