6 research outputs found

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

    Data quality measures for identity resolution

    Get PDF
    The explosion in popularity of online social networks has led to increased interest in identity resolution from security practitioners. Being able to connect together the multiple online accounts of a user can be of use in verifying identity attributes and in tracking the activity of malicious users. At the same time, privacy researchers are exploring the same phenomenon with interest in identifying privacy risks caused by re-identification attacks. Existing literature has explored how particular components of an online identity may be used to connect profiles, but few if any studies have attempted to assess the comparative value of information attributes. In addition, few of the methods being reported are easily comparable, due to difficulties with obtaining and sharing ground- truth data. Attempts to gain a comprehensive understanding of the identifiability of profile attributes are hindered by these issues. With a focus on overcoming these hurdles to effective research, this thesis first develops a methodology for sampling ground-truth data from online social networks. Building on this with reference to both existing literature and samples of real profile data, this thesis describes and grounds a comprehensive matching schema of profile attributes. The work then defines data quality measures which are important for identity resolution, and measures the availability, consistency and uniqueness of the schema’s contents. The developed measurements are then applied in a feature selection scheme to reduce the impact of missing data issues common in identity resolution. Finally, this thesis addresses the purposes to which identity resolution may be applied, defining the further application-oriented data quality measurements of novelty, veracity and relevance, and demonstrating their calculation and application for a particular use case: evaluating the social engineering vulnerability of an organisation

    A scalable framework for cross-lingual authorship identification

    Get PDF
    This is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.Published versio

    Urdu AI: writeprints for Urdu authorship identification

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing on 31/10/2021, available online at: https://doi.org/10.1145/3476467 The accepted version of the publication may differ from the final published version.The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works

    AUTHOR VERIFICATION OF ELECTRONIC MESSAGING SYSTEMS

    Get PDF
    Messaging systems have become a hugely popular new paradigm for sending and delivering text messages; however, online messaging platforms have also become an ideal place for criminals due to their anonymity, ease of use and low cost. Therefore, the ability to verify the identity of individuals involved in criminal activity is becoming increasingly important. The majority of research in this area has focused on traditional authorship problems that deal with single-domain datasets and large bodies of text. Few research studies have sought to explore multi-platform author verification as a possible solution to problems around forensics and security. Therefore, this research has investigated the ability to identify individuals on messaging systems, and has applied this to the modern messaging platforms of Email, Twitter, Facebook and Text messages, using different single-domain datasets for population-based and user-based verification approaches. Through a novel technique of cross-domain research using real scenarios, the domain incompatibilities of profiles from different distributions has been assessed, based on real-life corpora using data from 50 authors who use each of the aforementioned domains. The results show that the use of linguistics is likely be similar between platforms, on average, for a population-based approach. The best corpus experimental result achieved a low EER of 7.97% for Text messages, showing the usefulness of single-domain platforms where the use of linguistics is likely be similar, such as Text messages and Emails. For the user-based approach, there is very little evidence of a strong correlation of stylometry between platforms. It has been shown that linguistic features on some individual platforms have features in common with other platforms, and lexical features play a crucial role in the similarities between users’ modern platforms. Therefore, this research shows that the ability to identify individuals on messaging platforms may provide a viable solution to problems around forensics and security, and help against a range of criminal activities, such as sending spam texts, grooming children, and encouraging violence and terrorism.Royal Embassy of Saudi Arabia, Londo

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
    corecore