118 research outputs found

    Online Sexual Predator Detection

    Get PDF
    Online sexual abuse is a concerning yet severely overlooked vice of modern society. With more children being on the Internet and with the ever-increasing advent of web-applications such as online chatrooms and multiplayer games, preying on vulnerable users has become more accessible for predators. In recent years, there has been work on detecting online sexual predators using Machine Learning and deep learning techniques. Such work has trained on severely imbalanced datasets, and imbalance is handled via manual trimming of over-represented labels. In this work, we propose an approach that first tackles the problem of imbalance and then improves the effectiveness of the underlying classifiers. Our evaluation of the proposed sampling approach on PAN benchmark dataset shows performance improvements on several classification metrics, compared to prior methods that otherwise require hands-crafted sampling of the data

    Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning

    Get PDF
    Providing a safe environment for juveniles and children in online social networks is considered as a major factor in improving public safety. Due to the prevalence of the online conversations, mitigating the undesirable effects of juvenile abuse in cyberspace has become inevitable. Using automatic ways to address this kind of crime is challenging and demands efficient and scalable data mining techniques. The problem can be casted as a combination of textual preprocessing in data/text mining and binary classification in machine learning. This thesis proposes two machine learning approaches to deal with the following two issues in the domain of online predator identification: 1) The first problem is gathering a comprehensive set of negative training samples which is unrealistic due to the nature of the problem. This problem is addressed by applying an existing method for semi-supervised anomaly detection that allows the training process based on only one class label. The method was tested on two datasets; 2) The second issue is improving the performance of current binary classification methods in terms of classification accuracy and F1-score. In this regard, we have customized a deep learning approach called Convolutional Neural Network to be used in this domain. Using this approach, we show that the classification performance (F1-score) is improved by almost 1.7% compared to the classification method (Support Vector Machine). Two different datasets were used in the empirical experiments: PAN-2012 and SQ (Sûreté du Québec). The former is a large public dataset that has been used extensively in the literature and the latter is a small dataset collected from the Sûreté du Québec

    A human-centered systematic literature review of the computational approaches for online sexual risk detection

    Full text link
    In the era of big data and artificial intelligence, online risk detection has become a popular research topic. From detecting online harassment to the sexual predation of youth, the state-of-the-art in computational risk detection has the potential to protect particularly vulnerable populations from online victimization. Yet, this is a high-risk, high-reward endeavor that requires a systematic and human-centered approach to synthesize disparate bodies of research across different application domains, so that we can identify best practices, potential gaps, and set a strategic research agenda for leveraging these approaches in a way that betters society. Therefore, we conducted a comprehensive literature review to analyze 73 peer-reviewed articles on computational approaches utilizing text or meta-data/multimedia for online sexual risk detection. We identified sexual grooming (75%), sex trafficking (12%), and sexual harassment and/or abuse (12%) as the three types of sexual risk detection present in the extant literature. Furthermore, we found that the majority (93%) of this work has focused on identifying sexual predators after-the-fact, rather than taking more nuanced approaches to identify potential victims and problematic patterns that could be used to prevent victimization before it occurs. Many studies rely on public datasets (82%) and third-party annotators (33%) to establish ground truth and train their algorithms. Finally, the majority of this work (78%) mostly focused on algorithmic performance evaluation of their model and rarely (4%) evaluate these systems with real users. Thus, we urge computational risk detection researchers to integrate more human-centered approaches to both developing and evaluating sexual risk detection algorithms to ensure the broader societal impacts of this important work.Accepted manuscrip

    Protectbot: A Chatbot to Protect Children on Gaming Platforms

    Get PDF
    Online gaming no longer has limited access, as it has become available to a high percentage of children in recent years. Consequently, children are exposed to multifaceted threats, such as cyberbullying, grooming, and sexting. The online gaming industry is taking concerted measures to create a safe environment for children to play and interact with, such efforts remain inadequate and fragmented. Different approaches utilizing machine learning (ML) techniques to detect child predatory behavior have been designed to provide potential detection and protection in this context. After analyzing the available AI tools and solutions it was observed that the available solutions are limited to the identification of predatory behavior in chat logs which is not enough to avert the multifaceted threats. In this thesis, we developed a chatbot Protectbot to interact with the suspect on the gaming platform. Protectbot leveraged the dialogue generative pre-trained transformer (DialoGPT) model which is based on Generative Pre-trained Transformer 2 (GPT-2). To analyze the suspect\u27s behavior, we developed a text classifier based on natural language processing that can classify the chats as predatory and non-predatory. The developed classifier is trained and tested on Pan 12 dataset. To convert the text into numerical vectors we utilized fastText. The best results are obtained by using non-linear SVM on sentence vectors obtained from fastText. We got a recall of 0.99 and an F_0.5-score of 0.99 which is better than the state-of-the-art methods. We also built a new dataset containing 71 predatory full chats retrieved from Perverted Justice. Using sentence vectors generated by fastText and KNN classifier, 66 chats out of 71 were correctly classified as predatory chats

    Predicting US Elections with Social Media and Neural Networks

    Get PDF
    Increasingly, politicians and political parties are engaging their electors using social media. In the US Federal Election of 2016, candidates from both parties made heavy use of Social Media, particularly Twitter. It is then reasonable to attempt to find a correlation between popularity on Twitter, and eventual popular vote in the election. In this thesis, we will focus on using the subscriber ‘location’ field in the profile of each candidate to estimate support in each state. A major challenge is that the Twitter location field in a user profile is not constrained, requiring the application of machine learning techniques to cluster users according to state. In this thesis, we will train a Deep Convolutional Neural Network (CNN) to classify place names by state. Then we will apply the model to the Twitter Subscriber ‘location’ field of Twitter subscribers collected from each of the two candidates, Hillary Clinton (D), and Donald Trump (R). Finally, we will compare predicted popular votes in each state, to the actual results from the 2016 Presidential Election. The hypothesis is that a city name has a strong correlation to the people who founded it and then incorporated it. Further, it’s hypothesized that the original settlers were mostly homogeneous, relative to the country of origin and shared a common language, thus resulting in place names using the language of their origin. In addition to learning the pattern related to the State Names, this additional information may help a machine learning model learn to classify locations by state. The results from our experiments are very promising. Using a dataset containing 695,389 cities, correctly labelled with their state, we partitioned the cities into a training dataset containing 556,311 cities, a validation dataset containing 111,262, and a test dataset containing 27,816. After the trained model was applied to the test dataset. We achieved a Correct Prediction rate of 84.4365%, a False Negative rate of 1.6106%, and a False Positive rate of 1.0697%. Applying the trained model on Twitter Location data of subscribers of the two candidates, the model achieved an accuracy of 90%. The trained model was able to correctly pick the winner, by popular vote, in 45 out of the 50 states. With another US and Canadian election coming up in 2019, and 2020, it would be interesting to test the model on those as well

    A Human-Centered Approach to Improving Adolescent Online Sexual Risk Detection Algorithms

    Get PDF
    Computational risk detection has the potential to protect especially vulnerable populations from online victimization. Conducting a comprehensive literature review on computational approaches for online sexual risk detection led to the identification that the majority of this work has focused on identifying sexual predators after-the-fact. Also, many studies rely on public datasets and third-party annotators to establish ground truth and train their algorithms, which do not accurately represent young social media users and their perspectives to prevent victimization. To address these gaps, this dissertation integrated human-centered approaches to both creating representative datasets and developing sexual risk detection machine learning models to ensure the broader societal impacts of this important work. In order to understand what and how adolescents talk about their online sexual interactions to inform study designs, a thematic content analysis of posts by adolescents on an online peer support mental health was conducted. Then, a user study and web-based platform, Instagram Data Donation (IGDD), was designed to create an ecologically valid dataset. Youth could donate and annotate their Instagram data for online risks. After participating in the study, an interview study was conducted to understand how youth felt annotating data for online risks. Based on private conversations annotated by participants, sexual risk detection classifiers were created. The results indicated Convolutional Neural Network (CNN) and Random Forest models outperformed in identifying sexual risks at the conversation-level. Our experiments showed that classifiers trained on entire conversations performed better than message-level classifiers. We also trained classifiers to detect the severity risk level of a given message with CNN outperforming other models. We found that contextual (e.g., age, gender, and relationship type) and psycho-linguistic features contributed the most to accurately detecting sexual conversations. Our analysis provides insights into the important factors that enhance automated detection of sexual risks within youths\u27 private conversations

    Malicious Interlocutor Detection Using Forensic Analysis of Historic Data

    Get PDF
    The on-going problem of child grooming online grows year on year and whilst government legislation looks to combat the issue by levying heavier penalties on perpetrators of online grooming, crime figures still increase. Government guidance directed towards digital platforms and social media providers places emphasis on child safety online. As this research shows, government initiatives have proved somewhat ineffective. Therefore, the aim of this research is to investigate the scale of the of the problem and test a variety of machine learning and deep learning techniques that could be used in a novel intelligent solution to protect children from online predation. The heterogeneity of online platforms means that a one size fits all solution presents a complex problem that needs to be solved. The maturity of intelligent approaches to Natural Language Processing makes it possible to analyse and process text data in a wide variety of ways. Pre-processing data enables the preparation of text data in a format that machines can understand and reason about without the need for human interaction. The on-going development of Machine Learning and Deep Learning architectures enables the construction of intelligent solutions that can classify text data in ways never imagined. This thesis presents research that tests the application of potential intelligent solutions such as Artificial Neural Networks and Machine Learning algorithms applied in Natural Language Processing. The research also tests the performance of pre-processing workflows and the impact of pre-processing of both online grooming and more general chat corpora. The storage and processing of data via a traditional relational database management system has also been tested for suitability when looking to detect grooming conversation in historical data. The on-going development of Machine Learning and Deep Learning architectures enables the construction of intelligent solutions that can classify text data in ways never imagined. This thesis presents research that tests the application of potential intelligent solutions such as Artificial Neural Networks and Machine Learning algorithms applied in Natural Language Processing. The research also tests the performance of pre-processing workflows and the impact of pre-processing of both online grooming and more general chat corpora. The storage and processing of data via a traditional relational database management system has also been tested for suitability when looking to detect grooming conversation in historical data. Document similarity measures such as Cosine Similarity and Support Vector Machines have displayed positive results in identifying grooming conversation, however, a more intelligent solution may prove to have better currency in developing a smart autonomous solution given the ever-evolving lexicon used by participants in online chat conversations

    Identificação de predadores sexuais brasileiros em conversas textuais na internet por meio de aprendizagem de máquina

    Get PDF
    Nos dias de hoje um grande número de crianças e adolescentes tem usado aplicações sociais. De fácil acesso, essas aplicações promovem benefícios e oportunidades. No entanto, ao mesmo tempo, expõem os usuários à diferentes riscos, dentre os quais a atividade predatória sexual. A atividade predatória sexual possui diversas finalidades como a obtenção de pornografia infantil, a extorsão e o abuso sexual. O presente trabalho possui três objetivos principais: (i) criar um conjunto de dados de conversas textuais contendo atividade sexual predatória real para o português do Brasil; (ii) realizar uma análise estatística das conversas textuais presentes nesse conjunto de dados; (iii) realizar uma avaliação experimental considerando os algoritmos de aprendizado de máquina mais populares no domínio da pesquisa com o conjunto de dados construído. Essa avaliação considera a medida de F1 como base. Os resultados alcançados com as contribuições (i) e (ii) possibilitam que novos estudos possam se concentrar na problemática da identificação de predadores sexuais em conversas textuais para o português do Brasil. Os resultados obtidos com a contribuição (iii) evidenciam que as Máquinas de vetores de suporte obtiveram o melhor comportamento, apresentando um resultado de 89.87%
    • …
    corecore