14 research outputs found

    Computational approaches for verbal deception detection.

    Get PDF
    Deception exists in all aspects of life and is particularly evident on the Web. Deception includes child sexual predators grooming victims online, medical news headlines with little medical evidence or scientific rigour, individuals claiming others’ work as their own, and systematic deception of company shareholders and institutional investors leading to corporate collapses. This thesis explores the potential for automatic detection of deception. We investigate the nature of deception and the related cues, focusing in particular on Verbal Cues, and concluding that they cannot be readily generalised. We demonstrate how deception-specific features, based on sound hypotheses, can overcome related limitations by presenting approaches for three different examples of deception – namely Child Sexual Predator Detection (SPD), Authorship Identification (AI) and Intrinsic Plagiarism Detection (IPD). We further show how our approaches result in competitive levels of reliability. For SPD we develop our approach largely based on the commonality of requests for key personal information. To address AI, we introduce approaches based on a frequency-mean-variance and a frequency-only framework in order to detect strong associations between co-occurring patterns of a limited number of stopwords. Our IPD approaches are based on simple commonality of words at document level and usage of proper nouns; document sections lacking commonality can be identified as plagiarised. The frameworks of the International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN) competitions provided an independent evaluation of the approaches. The SPD approach obtained an F1 score of 0.48. F1 scores of 0.47, 0.53 and 0.57 were achieved in AI tasks for PAN2012, 2013 and 2014 respectively. IPD yielded an overall accuracy of 91%. Through post-competition adaptations we also show how to improve the approaches and the scores and demonstrate the importance of suitable datasets and how most approaches are not easily transferable between various types of deception

    A Trinity of Trials: Surrey's 2014 Attempts at Author Verification---Notebook for PAN at CLEF 2014

    No full text
    Encouraged by results from our approaches in previous PAN workshops, this paper explores three different approaches using stopword cooccurrence. High frequency patterns of co-occurrence can be used to some extent as identifiers of an author’s style, and have been demonstrated to operate similarly across certain languages - without requiring deeper linguistic knowledge. However, making best use of such information remains unresolved. We compare results from applying three approaches overs such patterns: a frequency-mean-variance framework; a positional-frequency cosine comparison approach, and a cosine distance-based approach. A clearly advantageous approach across all languages and genres is yet to emerge

    A Trinity of Trials: Surrey's 2014 Attempts at Author Verification---Notebook for PAN at CLEF 2014

    No full text
    Encouraged by results from our approaches in previous PAN workshops, this paper explores three different approaches using stopword cooccurrence. High frequency patterns of co-occurrence can be used to some extent as identifiers of an author’s style, and have been demonstrated to operate similarly across certain languages - without requiring deeper linguistic knowledge. However, making best use of such information remains unresolved. We compare results from applying three approaches overs such patterns: a frequency-mean-variance framework; a positional-frequency cosine comparison approach, and a cosine distance-based approach. A clearly advantageous approach across all languages and genres is yet to emerge

    From English to Persian: Conversion of Text Alignment for Plagiarism Detection

    Get PDF
    This paper briefly describes the approach taken to Persian Plagiarism Detection based on modification to the approach used for PAN between 2011 and 2014 in order to adapt to Persian. This effort has offered us the opportunity to evaluate detection performance for the same approach with another language. A key part of the motivation remains that of undertaking plagiarism detection in such a way as to make it highly unlikely that the content being matched against could be determined based on the matches made, and hence to allow for privacy

    From English to Persian: Conversion of Text Alignment for Plagiarism Detection

    No full text
    ABSTRACT This paper briefly describes the approach taken to Persian Plagiarism Detection based on modification to the approach used for PAN between 2011 and 2014 in order to adapt to Persian. This effort has offered us the opportunity to evaluate detection performance for the same approach with another language. A key part of the motivation remains that of undertaking plagiarism detection in such a way as to make it highly unlikely that the content being matched against could be determined based on the matches made, and hence to allow for privacy. CCS Concepts • Information systems → Near-duplicate and plagiarism detection • Information systems → Evaluation of retrieval results

    A Big Increase in Known Unknowns: from Author Verification to Author Clustering - Notebook for PAN at CLEF 2016

    Get PDF
    Previous PAN workshops have afforded evaluation of our approaches to author verification/identification based on stopword cooccurrence patterns. Problems have tended to involve comparing one document to a small set of documents (n<=5) of known authorship. This paper discusses the adaptation of one of our approaches to a PAN 2016 problem of author clustering, which involves generating clusters within larger sets of documents (n<=100) for an unknown number of distinct authors, where each set is in English, Dutch or Greek. We describe our previous approaches as the background to the approach taken to this task and briefly overview the results that were achieved, which are not expected to be particularly remarkable due to substantial limitations on our time around the task

    Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, and

    No full text
    Abstract. Tasks such as Authorship Attribution, Intrinsic Plagiarism detection and Sexual Predator Identification are representative of attempts to deceive. In the first two, authors try to convince others that the presented work is theirs, and in the third there is an attempt to convince readers to take actions based on false beliefs or ill-perceived risks. In this paper, we discuss our approaches to these tasks in the Author Identification track at PAN2012, which represents our first proper attempt at any of them. Our initial intention was to determine whether cues of deception, documented in the literature, might be relevant to such tasks. However, it quickly became apparent that such cues would not be readily useful, and we discuss the results achieved using some simple but relatively novel approaches: for the Traditional Authorship Attribution task, we show how a mean-variance framework using just 10 stopwords detects 42.8% and could be obtain 52.12 % using fewer; for Intrinsic Plagiarism Detection, frequent words achieved 91.1 % overall; and for Sexual Predator Identification, we used just a few features covering requests for personal information, with mixed results

    A Big Increase in Known Unknowns: from Author Verification to Author Clustering - Notebook for PAN at CLEF 2016

    No full text
    Previous PAN workshops have afforded evaluation of our approaches to author verification/identification based on stopword cooccurrence patterns. Problems have tended to involve comparing one document to a small set of documents (n<=5) of known authorship. This paper discusses the adaptation of one of our approaches to a PAN 2016 problem of author clustering, which involves generating clusters within larger sets of documents (n<=100) for an unknown number of distinct authors, where each set is in English, Dutch or Greek. We describe our previous approaches as the background to the approach taken to this task and briefly overview the results that were achieved, which are not expected to be particularly remarkable due to substantial limitations on our time around the task
    corecore