337 research outputs found

    A Framework for Stylometric Similarity Detection in Online Settings

    Get PDF

    A framework for securing email entrances and mitigating phishing impersonation attacks

    Full text link
    Emails are used every day for communication, and many countries and organisations mostly use email for official communications. It is highly valued and recognised for confidential conversations and transactions in day-to-day business. The Often use of this channel and the quality of information it carries attracted cyber attackers to it. There are many existing techniques to mitigate attacks on email, however, the systems are more focused on email content and behaviour and not securing entrances to email boxes, composition, and settings. This work intends to protect users' email composition and settings to prevent attackers from using an account when it gets hacked or hijacked and stop them from setting forwarding on the victim's email account to a different account which automatically stops the user from receiving emails. A secure code is applied to the composition send button to curtail insider impersonation attack. Also, to secure open applications on public and private devices

    Fighting Authorship Linkability with Crowdsourcing

    Full text link
    Massive amounts of contributed content -- including traditional literature, blogs, music, videos, reviews and tweets -- are available on the Internet today, with authors numbering in many millions. Textual information, such as product or service reviews, is an important and increasingly popular type of content that is being used as a foundation of many trendy community-based reviewing sites, such as TripAdvisor and Yelp. Some recent results have shown that, due partly to their specialized/topical nature, sets of reviews authored by the same person are readily linkable based on simple stylometric features. In practice, this means that individuals who author more than a few reviews under different accounts (whether within one site or across multiple sites) can be linked, which represents a significant loss of privacy. In this paper, we start by showing that the problem is actually worse than previously believed. We then explore ways to mitigate authorship linkability in community-based reviewing. We first attempt to harness the global power of crowdsourcing by engaging random strangers into the process of re-writing reviews. As our empirical results (obtained from Amazon Mechanical Turk) clearly demonstrate, crowdsourcing yields impressively sensible reviews that reflect sufficiently different stylometric characteristics such that prior stylometric linkability techniques become largely ineffective. We also consider using machine translation to automatically re-write reviews. Contrary to what was previously believed, our results show that translation decreases authorship linkability as the number of intermediate languages grows. Finally, we explore the combination of crowdsourcing and machine translation and report on the results

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    StyloThai: A scalable framework for stylometric authorship identification of Thai documents

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in January 2020, available online: https://doi.org/10.1145/3365832 The accepted version of the publication may differ from the final published version.© 2020 Association for Computing Machinery. All rights reserved. Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low.We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic k nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study.The research was partially supported by the Digital Economy Promotion Agency (project# MP-62- 0003); and Thailand Research Fund and Office of the Higher Education Commission (MRG6180266).Published versio

    A scalable framework for stylometric analysis query processing

    Get PDF
    This is an accepted manuscript of an article published by IEEE in 2016 IEEE 16th International Conference on Data Mining (ICDM) on 02/02/2017, available online: https://ieeexplore.ieee.org/document/7837960 The accepted version of the publication may differ from the final published version.Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.Published versio
    • …
    corecore