44 research outputs found

    A Framework for Stylometric Similarity Detection in Online Settings

    Get PDF

    The Stylometric Processing of Sensory Open Source Data

    Get PDF
    This research project’s end goal is on the Lone Wolf Terrorist. The project uses an exploratory approach to the self-radicalisation problem by creating a stylistic fingerprint of a person's personality, or self, from subtle characteristics hidden in a person's writing style. It separates the identity of one person from another based on their writing style. It also separates the writings of suicide attackers from ‘normal' bloggers by critical slowing down; a dynamical property used to develop early warning signs of tipping points. It identifies changes in a person's moods, or shifts from one state to another, that might indicate a tipping point for self-radicalisation. Research into authorship identity using personality is a relatively new area in the field of neurolinguistics. There are very few methods that model how an individual's cognitive functions present themselves in writing. Here, we develop a novel algorithm, RPAS, which draws on cognitive functions such as aging, sensory processing, abstract or concrete thinking through referential activity emotional experiences, and a person's internal gender for identity. We use well-known techniques such as Principal Component Analysis, Linear Discriminant Analysis, and the Vector Space Method to cluster multiple anonymous-authored works. Here we use a new approach, using seriation with noise to separate subtle features in individuals. We conduct time series analysis using modified variants of 1-lag autocorrelation and the coefficient of skewness, two statistical metrics that change near a tipping point, to track serious life events in an individual through cognitive linguistic markers. In our journey of discovery, we uncover secrets about the Elizabethan playwrights hidden for over 400 years. We uncover markers for depression and anxiety in modern-day writers and identify linguistic cues for Alzheimer's disease much earlier than other studies using sensory processing. In using these techniques on the Lone Wolf, we can separate their writing style used before their attacks that differs from other writing

    Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

    Get PDF
    Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity

    Authorship attribution of late 19th century novels using GAN-BERT

    Get PDF
    Authorship attribution aims to identify the author of an anonymous text. The task becomes even more worthwhile when it comes to literary works. For example, pen names were commonly used by female authors in the 19th century resulting in some literary works being incorrectly attributed or claimed. With this motivation, we collated a dataset of late 19th century novels in English. Due to the imbalance in the dataset and the unavailability of enough data per author, we employed the GANBERT model along with data sampling strategies to fine-tune a transformer-based model for authorship attribution. Differently from the earlier studies on the GAN-BERT model, we conducted transfer learning on comparatively smaller author subsets to train more focused author-specific models yielding performance over 0.88 accuracy and F1 scores. Furthermore, we observed that increasing the sample size has a negative impact on the model’s performance. Our research mainly contributes to the ongoing authorship attribution research using GAN-BERT architecture, especially in attributing disputed novelists in the late 19th century

    Deception in Authorship Attribution

    Get PDF
    In digital forensics, questions often arise about the authors of documents: their identity, demographic background, and whether they can be linked to other documents. The field of stylometry uses linguistic features and machine learning techniques to answer these questions. While stylometry techniques can identify authors with high accuracy in non-adversarial scenarios, their accuracy is reduced to random guessing when faced with authors who intentionally obfuscate their writing style or attempt to imitate that of another author. Most authorship attribution methods were not evaluated in challenging real-world datasets with foreign language and unconventional spelling (e.g. l33tsp3ak). In this thesis we explore the performance of authorship attribution methods in adversarial settings where authors take measures to hide their identity by changing their writing style and by creating multiple identities. We show that using a large feature set, it is possible to distinguish regular documents from deceptive documents with high accuracy and present an analysis of linguistic features that can be modified to hide writing style. We show how to adapt regular authorship attribution to difficult datasets such as leaked underground forum and present a method for detecting multiple identities of authors. We demonstrate the utility of our approach with a case study that includes applying our technique to an underground forum and manual analysis to validate the results, enabling the discovery of previously undetected multiple accounts.Ph.D., Computer Science -- Drexel University, 201

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

    āļāļēāļĢāļĢāļ°āļšāļļāļ•āļąāļ§āļœāļđāđ‰āđ€āļ‚āļĩāļĒāļ™āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ­āļ­āļ™āđ„āļĨāļ™āđŒāļ āļēāļĐāļēāđ„āļ—āļĒāļ”āđ‰āļ§āļĒāļ‹āļąāļžāļžāļ­āļĢāđŒāļ•āđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļĄāļŠāļŠāļĩāļ™āđāļĨāļ°āļ•āđ‰āļ™āđ„āļĄāđ‰āļ•āļąāļ”āļŠāļīāļ™āđƒāļˆ

    Get PDF
    āļšāļ—āļ„āļąāļ”āļĒāđˆāļ­āļ›āļąāļāļŦāļēāļŦāļ™āļķāđˆāļ‡āļ—āļĩāđˆāļĄāļēāļžāļĢāđ‰āļ­āļĄāļāļąāļšāļāļēāļĢāđƒāļŠāđ‰āļŠāļ·āđˆāļ­āļŠāļąāļ‡āļ„āļĄāļ­āļ­āļ™āđ„āļĨāļ™āđŒāđƒāļ™āļ›āļĢāļ°āđ€āļ—āļĻāđ„āļ—āļĒāļ„āļ·āļ­ āļāļēāļĢāđ‚āļžāļŠāļ•āđŒāļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļĨāđˆāļ­āļĨāļ§āļ‡ āļŦāļĄāļīāđˆāļ™āļ›āļĢāļ°āļĄāļēāļ—āļŦāļĢāļ·āļ­āđ€āļœāļĒāđāļžāļĢāđˆāļ‚āđ‰āļ­āļĄāļđāļĨāļ‚āđˆāļēāļ§āļŠāļēāļĢāļ—āļĩāđˆāđ€āļ›āđ‡āļ™āđ€āļ—āđ‡āļˆ āļœāļđāđ‰āđ€āļ‚āļĩāļĒāļ™āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ­āļēāļˆāđƒāļŠāđ‰āļŠāļ·āđˆāļ­āļ›āļĨāļ­āļĄāļŦāļĢāļ·āļ­āđāļ­āļšāļ­āđ‰āļēāļ‡āđ€āļ›āđ‡āļ™āļ„āļ™āļ­āļ·āđˆāļ™āđāļ•āđˆāļĢāļđāļ›āđāļšāļšāļĨāļĩāļĨāļēāļāļēāļĢāđ€āļ‚āļĩāļĒāļ™āļšāļēāļ‡āļ­āļĒāđˆāļēāļ‡āļ—āļĩāđˆāđ€āļ›āđ‡āļ™āļĢāļŠāļ™āļīāļĒāļĄāļŠāđˆāļ§āļ™āļ•āļąāļ§āļŦāļĢāļ·āļ­āđ€āļāļīāļ”āļˆāļēāļāļ„āļ§āļēāļĄāđ€āļ„āļĒāļŠāļīāļ™ āđ€āļŠāđˆāļ™ āļāļēāļĢāđƒāļŠāđ‰āļ„āļģāđ€āļĢāļĩāļĒāļāļ•āļąāļ§āđ€āļ­āļ‡ āļ„āļģāļĨāļ‡āļ—āđ‰āļēāļĒāļ›āļĢāļ°āđ‚āļĒāļ„ āđ€āļ„āļĢāļ·āđˆāļ­āļ‡āļŦāļĄāļēāļĒāļ§āļĢāļĢāļ„āļ•āļ­āļ™ āļĒāļąāļ‡āļ›āļĢāļēāļāļāļĢāđˆāļ­āļ‡āļĢāļ­āļĒāļ­āļĒāļđāđˆāđāļĨāļ°āļŠāļēāļĄāļēāļĢāļ–āļ•āļĢāļ§āļˆāļˆāļąāļšāđ„āļ”āđ‰āļ‡āļēāļ™āļ§āļīāļˆāļąāļĒāļ™āļĩāđ‰āļˆāļķāļ‡āļ„āļąāļ”āđ€āļĨāļ·āļ­āļāļ„āļļāļ“āļĨāļąāļāļĐāļ“āļ°āđƒāļ™āļāļēāļĢāđ€āļ‚āļĩāļĒāļ™āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ­āļ­āļ™āđ„āļĨāļ™āđŒāļ āļēāļĐāļēāđ„āļ—āļĒāļˆāļģāļ™āļ§āļ™ 53 āļ„āļļāļ“āļĨāļąāļāļĐāļ“āļ°āđāļĨāļ°āđƒāļŠāđ‰āļ„āļļāļ“āļĨāļąāļāļĐāļ“āļ°āđ€āļŦāļĨāđˆāļēāļ™āļĩāđ‰āđƒāļ™āļāļēāļĢāļĢāļ°āļšāļļāļ•āļąāļ§āļœāļđāđ‰āđ€āļ‚āļĩāļĒāļ™āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ™āļīāļĢāļ™āļēāļĄ āđ‚āļ”āļĒāļ§āļīāļ˜āļĩāļāļēāļĢāļ—āļĩāđˆāđ€āļĨāļ·āļ­āļāđƒāļŠāđ‰āļ„āļ·āļ­āļāļēāļĢāļˆāļģāđāļ™āļāļ”āđ‰āļ§āļĒāļ‹āļąāļžāļžāļ­āļĢāđŒāļ•āđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļĄāļŠāļŠāļĩāļ™āđāļĨāļ°āļ•āđ‰āļ™āđ„āļĄāđ‰āļ•āļąāļ”āļŠāļīāļ™āđƒāļˆāđ€āļĄāļ·āđˆāļ­āļ—āļ”āļŠāļ­āļšāļāļąāļšāļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ‚āļ™āļēāļ”āļŠāļąāđ‰āļ™ (āļ„āļ§āļēāļĄāļĒāļēāļ§āđ€āļ‰āļĨāļĩāđˆāļĒ 144 āļ„āļģ) āļ‹āļąāļžāļžāļ­āļĢāđŒāļ•āđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļĄāļŠāļŠāļĩāļ™āđƒāļŦāđ‰āļ­āļąāļ•āļĢāļēāļ„āļ§āļēāļĄāļ–āļđāļāļ•āđ‰āļ­āļ‡āđ€āļ‰āļĨāļĩāđˆāļĒ 79% āļ•āđ‰āļ™āđ„āļĄāđ‰āļ•āļąāļ”āļŠāļīāļ™āđƒāļˆāđƒāļŦāđ‰āļ­āļąāļ•āļĢāļēāļ„āļ§āļēāļĄāļ–āļđāļāļ•āđ‰āļ­āļ‡āđ€āļ‰āļĨāļĩāđˆāļĒ 75% āđ€āļĄāļ·āđˆāļ­āļ—āļ”āļŠāļ­āļšāļāļąāļšāļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ‚āļ™āļēāļ”āļĒāļēāļ§āļ‚āļķāđ‰āļ™ (āļ„āļ§āļēāļĄāļĒāļēāļ§āđ€āļ‰āļĨāļĩāđˆāļĒ 312 āļ„āļģ) āļ—āļąāđ‰āļ‡āļŠāļ­āļ‡āļ§āļīāļ˜āļĩāđƒāļŦāđ‰āļ­āļąāļ•āļĢāļēāļ„āļ§āļēāļĄāļ–āļđāļāļ•āđ‰āļ­āļ‡āđ€āļ‰āļĨāļĩāđˆāļĒ 88% āđāļĨāļ° 82% āļ•āļēāļĄāļĨāļģāļ”āļąāļšāļ„āļģāļŠāļģāļ„āļąāļ: āļ‚āđ‰āļ­āļ„āļ§āļēāļĄāļ­āļ­āļ™āđ„āļĨāļ™āđŒāļāļēāļĢāļĢāļ°āļšāļļāļ•āļąāļ§āļœāļđāđ‰āđ€āļ‚āļĩāļĒāļ™ āļāļēāļĢāļˆāļģāđāļ™āļ āļ‹āļąāļžāļžāļ­āļĢāđŒāļ•āđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļĄāļŠāļŠāļĩāļ™ āļ•āđ‰āļ™āđ„āļĄāđ‰āļ•āļąāļ”āļŠāļīāļ™āđƒāļˆAbstractOne problem that comes with the use of online social media in Thailand is the posting of deceptive, abusive, or hoax messages. The authors of such messages may use fake accounts or impersonate innocent persons. But some of their writing styles, influenced by individual preferences or habits, such as the use of first-person pronouns, sentence-ending words, or punctuations can still be traced and detected. In this research, fifty-three writing attributes of Thai online messages were selected and used to identify the authors of anonymous messages. The identification methods were based on classification by support vector machine and decision tree. When testing with short messages (average length of 144 words), support vector machine yielded an average accuracy of 79% whereas decision tree yielded an average accuracy of 75%. When testing with long messages (average length of 312 words), both methods yielded average accuracies of 88% and 82%, respectively.Keywords: Online Messages, Author Identification, Classification, Support Vector Machine, Decision Tre

    Authorship Verification

    Get PDF
    In recent years, stylometry, the study of linguistic style, has become more prominent in security and privacy applications involving written language, mostly in digital and online domains. Although literature is abundant with computational stylometry research, the field of authorship verification is relatively unexplored. Authorship verification is the binary semi-open-world problem of determining whether a document is written by a given author or not. A key component in authorship verification techniques is confidence measurement, on which verification decisions are based, expressed by acceptance thresholds selected and tuned per need. This thesis demonstrates how utilization of confidence-based approaches in stylometric applications, and their combination with traditional approaches, can benefit classification accuracy, and allow new domains and problems to be analyzed. We start by motivating the usage of authorship verification approaches with two stylometric applications: native-language identification from non-native text and active linguistic user authentication. Next, we introduce the Classify-Verify algorithm, which integrates classification with binary verification, applied to several stylometric problems. Classify-Verify is proposed as an open-world alternative to restricted closed-world attribution methods, and is shown effective in dealing with possibly missing candidate authors by thwarting misclassifications, coping with various domains and scales, and even adversarial authors who try to fool the classifier.Ph.D., Computer Science -- Drexel University, 201
    corecore