4,222 research outputs found
An Investigation into the Pedagogical Features of Documents
Characterizing the content of a technical document in terms of its learning
utility can be useful for applications related to education, such as generating
reading lists from large collections of documents. We refer to this learning
utility as the "pedagogical value" of the document to the learner. While
pedagogical value is an important concept that has been studied extensively
within the education domain, there has been little work exploring it from a
computational, i.e., natural language processing (NLP), perspective. To allow a
computational exploration of this concept, we introduce the notion of
"pedagogical roles" of documents (e.g., Tutorial and Survey) as an intermediary
component for the study of pedagogical value. Given the lack of available
corpora for our exploration, we create the first annotated corpus of
pedagogical roles and use it to test baseline techniques for automatic
prediction of such roles.Comment: 12th Workshop on Innovative Use of NLP for Building Educational
Applications (BEA) at EMNLP 2017; 12 page
Keeping the data lake in form: DS-kNN datasets categorization using proximity mining
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.Peer ReviewedPostprint (author's final draft
- …