4,730 research outputs found

    Reviews Matter: How Distributed Mentoring Predicts Lexical Diversity on Fanfiction.net

    Full text link
    Fanfiction.net provides an informal learning space for young writers through distributed mentoring, networked giving and receiving of feedback. In this paper, we quantify the cumulative effect of feedback on lexical diversity for 1.5 million authors.Comment: Connected Learning Summit 201

    New tools for teaching writing

    Get PDF

    Adding dimensions to the analysis of the quality of health information of websites returned by Google. Cluster analysis identifies patterns of websites according to their classification and the type of intervention described.

    Get PDF
    Background and aims: Most of the instruments used to assess the quality of health information on the Web (e.g. the JAMA criteria) only analyze one dimension of information quality, trustworthiness. We try to compare these characteristics with the type of treatments the website describe, whether evidence-based medicine or note, and correlate this with the established criteria. Methods: We searched Google for “migraine cure” and analyzed the first 200 websites for: 1) JAMA criteria (authorship, attribution, disclosure, currency); 2) class of websites (commercial, health portals, professional, patient groups, no-profit); and 3) type of intervention described (approved drugs, alternative medicine, food, procedures, lifestyle, drugs still at the research stage). We used hierarchical cluster analysis to assess associations between classes of websites and types of intervention described. Subgroup analysis on the first 10 websites returned was performed. Results: Google returned health portals (44%), followed by commercial websites (31%) and journalism websites (11%). The type of intervention mentioned most often was alternative medicine (55%), followed by procedures (49%), lifestyle (42%), food (41%) and approved drugs (35%). Cluster analysis indicated that health portals are more likely to describe more than one type of treatment while commercial websites most often describe only one. The average JAMA score of commercial websites was significantly lower than for health portals or journalism websites, and this was mainly due to lack of information on the authors of the text and indication of the date the information was written. Looking at the first 10 websites from Google, commercial websites are under-represented and approved drugs over-represented. Conclusions: This approach allows the appraisal of the quality of health-related information on the Internet focusing on the type of therapies/prevention methods that are shown to the patient

    Automatic coding of short text responses via clustering in educational assessment

    Full text link
    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the Programme for International Student Assessment (PISA) 2012 in Germany. Free text responses of 10 items with Formula responses in total were analyzed. We further examined the effect of different methods, parameter values, and sample sizes on performance of the implemented system. The system reached fair to good up to excellent agreement with human codings Formula Especially items that are solved by naming specific semantic concepts appeared properly coded. The system performed equally well with Formula and somewhat poorer but still acceptable down to Formula Based on our findings, we discuss potential innovations for assessment that are enabled by automatic coding of short text responses. (DIPF/Orig.

    Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods

    Full text link
    Measuring the similarity of short written contexts is a fundamental problem in Natural Language Processing. This article provides a unifying framework by which short context problems can be categorized both by their intended application and proposed solution. The goal is to show that various problems and methodologies that appear quite different on the surface are in fact very closely related. The axes by which these categorizations are made include the format of the contexts (headed versus headless), the way in which the contexts are to be measured (first-order versus second-order similarity), and the information used to represent the features in the contexts (micro versus macro views). The unifying thread that binds together many short context applications and methods is the fact that similarity decisions must be made between contexts that share few (if any) words in common.Comment: 23 page

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data

    Get PDF
    Authorship attribution (AA) is the task of identifying authors of disputed or anonymous texts. It can be seen as a single, multi-class text classification task. It is concerned with writing style rather than topic matter. The scalability issue in traditional AA studies concerns the effect of data size, the amount of data per candidate author. This has not been probed in much depth yet, since most stylometry researches tend to focus on long texts per author or multiple short texts, because stylistic choices frequently occur less in such short texts. This paper investigates the task of authorship attribution on short historical Arabic texts written by10 different authors. Several experiments are conducted on these texts by extracting various lexical and character features of the writing style of each author, using N-grams word level (1,2,3, and 4) and character level (1,2,3, and 4) grams as a text representation. Then Naive Bayes (NB) classifier is employed in order to classify the texts to their authors. This is to show robustness of NB classifier in doing AA on very short-sized texts when compared to Support Vector Machines (SVMs). Using dataset (called AAAT) which consists of 3 short texts per author’s book, it is shown our method is at least as effective as Information Gain (IG) for the selection of the most significant n-grams. Moreover, the significance of punctuation marks is explored in order to distinguish between authors, showing that an increase in the performance can be achieved. As well, the NB classifier achieved high accuracy results. Since the experiments of AA task that are done on AAAT dataset show interesting results with a classification accuracy of the best score obtained up to 96% using N-gram word level 1gram. Keywords: Authorship attribution, Text classification, Naive Bayes classifier, Character n-grams features, Word n-grams features
    corecore