219 research outputs found

    Protectbot: A Chatbot to Protect Children on Gaming Platforms

    Get PDF
    Online gaming no longer has limited access, as it has become available to a high percentage of children in recent years. Consequently, children are exposed to multifaceted threats, such as cyberbullying, grooming, and sexting. The online gaming industry is taking concerted measures to create a safe environment for children to play and interact with, such efforts remain inadequate and fragmented. Different approaches utilizing machine learning (ML) techniques to detect child predatory behavior have been designed to provide potential detection and protection in this context. After analyzing the available AI tools and solutions it was observed that the available solutions are limited to the identification of predatory behavior in chat logs which is not enough to avert the multifaceted threats. In this thesis, we developed a chatbot Protectbot to interact with the suspect on the gaming platform. Protectbot leveraged the dialogue generative pre-trained transformer (DialoGPT) model which is based on Generative Pre-trained Transformer 2 (GPT-2). To analyze the suspect\u27s behavior, we developed a text classifier based on natural language processing that can classify the chats as predatory and non-predatory. The developed classifier is trained and tested on Pan 12 dataset. To convert the text into numerical vectors we utilized fastText. The best results are obtained by using non-linear SVM on sentence vectors obtained from fastText. We got a recall of 0.99 and an F_0.5-score of 0.99 which is better than the state-of-the-art methods. We also built a new dataset containing 71 predatory full chats retrieved from Perverted Justice. Using sentence vectors generated by fastText and KNN classifier, 66 chats out of 71 were correctly classified as predatory chats

    Identifying Online Sexual Predators Using Support Vector Machine

    Get PDF
    A two-stage classification model is built in the research for online sexual predator identification. The first stage identifies the suspicious conversations that have predator participants. The second stage identifies the predators in suspicious conversations. Support vector machines are used with word and character n-grams, combined with behavioural features of the authors to train the final classifier. The unbalanced dataset is downsampled to test the performance of re-balancing an unbalanced dataset. An age group classification model is also constructed to test the feasibility of extracting the age profile of the authors, which can be used as features for classifier training. The e↵ect of re-balancing the unbalanced dataset resulted in a better performance of the classifier. Testing the two-stage classification model on the unseen test set, 171 out of 254 predators are successfully identified giving a precision of 0.85, recall of 0.67 and f-score of 0.807. Comparing the classification performance with and without the behavioural feature, it can be seen the n-gram contributed the most to the performance of the classifier, while the behavioural features do not contribute significantly to the performance

    Methodologies for the Management, Normalization and Identification of Sexual Predation of Minors in Cyber Chat Logs

    Get PDF
    Neural networks based on the Transformer architecture have shown great results in tasks such as machine translation and text generation. Our contribution provides a methodology for an AI agent capable of Sexual Predator Identification (SPI) based on the classification capabilities of models built on the Transformer architecture. Results are comparable to existing state-of-the-art methods, with a F0.5 score of 92.5% for predator identification on the PAN2012 test dataset consisting of 2,004,235 lines of text. Practical considerations require an AI agent that can evaluate large numbers of chats quickly. In that regard the Transformer based AI agent is able to evaluate over 2 million lines of text in under 6 minutes on a modestly configured workstation. An AI agent by itself does not provide a complete solution to sexual predator identification. In an effort to give practical value to an AI agent, we address the vitally important but often overlooked issues of chat management and normalization. Our contribution provides a methodology for efficiently transforming raw chats from a native format into a consistent 'normalized' format suitable for analysis. We define a methodology to the problem of managing large numbers of chats, converting/normalizing 10,000 documents in a dataset in under 3 minutes on a modestly configured workstation. We present a software-based solution that among other things brings together chat management, normalization, and AI based analysis into a cohesive, productive environment that law enforcement can use to identify and build a case against suspected predators

    Online Sexual Predator Detection

    Get PDF
    Online sexual abuse is a concerning yet severely overlooked vice of modern society. With more children being on the Internet and with the ever-increasing advent of web-applications such as online chatrooms and multiplayer games, preying on vulnerable users has become more accessible for predators. In recent years, there has been work on detecting online sexual predators using Machine Learning and deep learning techniques. Such work has trained on severely imbalanced datasets, and imbalance is handled via manual trimming of over-represented labels. In this work, we propose an approach that first tackles the problem of imbalance and then improves the effectiveness of the underlying classifiers. Our evaluation of the proposed sampling approach on PAN benchmark dataset shows performance improvements on several classification metrics, compared to prior methods that otherwise require hands-crafted sampling of the data

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

    Exploration of Misogyny in Spanish and English tweets

    Get PDF

    A human-centered systematic literature review of the computational approaches for online sexual risk detection

    Full text link
    In the era of big data and artificial intelligence, online risk detection has become a popular research topic. From detecting online harassment to the sexual predation of youth, the state-of-the-art in computational risk detection has the potential to protect particularly vulnerable populations from online victimization. Yet, this is a high-risk, high-reward endeavor that requires a systematic and human-centered approach to synthesize disparate bodies of research across different application domains, so that we can identify best practices, potential gaps, and set a strategic research agenda for leveraging these approaches in a way that betters society. Therefore, we conducted a comprehensive literature review to analyze 73 peer-reviewed articles on computational approaches utilizing text or meta-data/multimedia for online sexual risk detection. We identified sexual grooming (75%), sex trafficking (12%), and sexual harassment and/or abuse (12%) as the three types of sexual risk detection present in the extant literature. Furthermore, we found that the majority (93%) of this work has focused on identifying sexual predators after-the-fact, rather than taking more nuanced approaches to identify potential victims and problematic patterns that could be used to prevent victimization before it occurs. Many studies rely on public datasets (82%) and third-party annotators (33%) to establish ground truth and train their algorithms. Finally, the majority of this work (78%) mostly focused on algorithmic performance evaluation of their model and rarely (4%) evaluate these systems with real users. Thus, we urge computational risk detection researchers to integrate more human-centered approaches to both developing and evaluating sexual risk detection algorithms to ensure the broader societal impacts of this important work.Accepted manuscrip

    Statistical models for the analysis of short user-generated documents: author identification for conversational documents

    Get PDF
    In recent years short user-generated documents have been gaining popularity on the Internet and attention in the research communities. This kind of documents are generated by users of the various online services: platforms for instant messaging communication, for real-time status posting, for discussing and for writing reviews. Each of these services allows users to generate written texts with particular properties and which might require specific algorithms for being analysed. In this dissertation we are presenting our work which aims at analysing this kind of documents. We conducted qualitative and quantitative studies to identify the properties that might allow for characterising them. We compared the properties of these documents with the properties of standard documents employed in the literature, such as newspaper articles, and defined a set of characteristics that are distinctive of the documents generated online. We also observed two classes within the online user-generated documents: the conversational documents and those involving group discussions. We later focused on the class of conversational documents, that are short and spontaneous. We created a novel collection of real conversational documents retrieved online (e.g. Internet Relay Chat) and distributed it as part of an international competition (PAN @ CLEF'12). The competition was about author characterisation, which is one of the possible studies of authorship attribution documented in the literature. Another field of study is authorship identification, that became our main topic of research. We approached the authorship identification problem in its closed-class variant. For each problem we employed documents from the collection we released and from a collection of Twitter messages, as representative of conversational or short user-generated documents. We proved the unsuitability of standard authorship identification techniques for conversational documents and proposed novel methods capable of reaching better accuracy rates. As opposed to standard methods that worked well only for few authors, the proposed technique allowed for reaching significant results even for hundreds of users

    Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning

    Get PDF
    Providing a safe environment for juveniles and children in online social networks is considered as a major factor in improving public safety. Due to the prevalence of the online conversations, mitigating the undesirable effects of juvenile abuse in cyberspace has become inevitable. Using automatic ways to address this kind of crime is challenging and demands efficient and scalable data mining techniques. The problem can be casted as a combination of textual preprocessing in data/text mining and binary classification in machine learning. This thesis proposes two machine learning approaches to deal with the following two issues in the domain of online predator identification: 1) The first problem is gathering a comprehensive set of negative training samples which is unrealistic due to the nature of the problem. This problem is addressed by applying an existing method for semi-supervised anomaly detection that allows the training process based on only one class label. The method was tested on two datasets; 2) The second issue is improving the performance of current binary classification methods in terms of classification accuracy and F1-score. In this regard, we have customized a deep learning approach called Convolutional Neural Network to be used in this domain. Using this approach, we show that the classification performance (F1-score) is improved by almost 1.7% compared to the classification method (Support Vector Machine). Two different datasets were used in the empirical experiments: PAN-2012 and SQ (Sûreté du Québec). The former is a large public dataset that has been used extensively in the literature and the latter is a small dataset collected from the Sûreté du Québec

    A Human-Centered Approach to Improving Adolescent Online Sexual Risk Detection Algorithms

    Get PDF
    Computational risk detection has the potential to protect especially vulnerable populations from online victimization. Conducting a comprehensive literature review on computational approaches for online sexual risk detection led to the identification that the majority of this work has focused on identifying sexual predators after-the-fact. Also, many studies rely on public datasets and third-party annotators to establish ground truth and train their algorithms, which do not accurately represent young social media users and their perspectives to prevent victimization. To address these gaps, this dissertation integrated human-centered approaches to both creating representative datasets and developing sexual risk detection machine learning models to ensure the broader societal impacts of this important work. In order to understand what and how adolescents talk about their online sexual interactions to inform study designs, a thematic content analysis of posts by adolescents on an online peer support mental health was conducted. Then, a user study and web-based platform, Instagram Data Donation (IGDD), was designed to create an ecologically valid dataset. Youth could donate and annotate their Instagram data for online risks. After participating in the study, an interview study was conducted to understand how youth felt annotating data for online risks. Based on private conversations annotated by participants, sexual risk detection classifiers were created. The results indicated Convolutional Neural Network (CNN) and Random Forest models outperformed in identifying sexual risks at the conversation-level. Our experiments showed that classifiers trained on entire conversations performed better than message-level classifiers. We also trained classifiers to detect the severity risk level of a given message with CNN outperforming other models. We found that contextual (e.g., age, gender, and relationship type) and psycho-linguistic features contributed the most to accurately detecting sexual conversations. Our analysis provides insights into the important factors that enhance automated detection of sexual risks within youths\u27 private conversations
    • …
    corecore