4 research outputs found

    Text Classification Using Association Rules, Dependency Pruning and Hyperonymization

    Full text link
    We present new methods for pruning and enhancing item- sets for text classification via association rule mining. Pruning methods are based on dependency syntax and enhancing methods are based on replacing words by their hyperonyms of various orders. We discuss the impact of these methods, compared to pruning based on tfidf rank of words.Comment: 16 pages, 2 figures, presented at DMNLP 201

    Automated Detection of Bilingual Obfuscated Abusive Words on Social Media Forums: A Case of Swahili and English Texts

    Get PDF
    The usage of social media has exponentially grown in recent years leaving the users with no limitations on misusing the platforms through abusive contents as deemed fit to them. This exacerbates abusive words exposure to innocent users, especially in social media forums, including children. In an attempt to alleviate the problem of abusive words proliferation on social media, researchers have proposed different methods to help deal with variants of the abusive words; however, obfuscated abusive words detection still poses challenges. A method that utilizes a combination of rule based approach and character percentage matching techniques is proposed to improve the detection rate for obfuscated abusive words. The evaluation results achieved F1 score percentage ratio of 0.97 and accuracy percentage ratio of 0.96 which were above the significance ratio of 0.5. Hence, the proposed approach is highly effective for obfuscated abusive words detection and prevention. Keywords:     Rule based approach, Character percentage matching techniques, Obfuscated abuse, Abuse detection, Abusive words, Social medi

    Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes

    Get PDF
    Nowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate.Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes

    Generación no supervisada de datos para la clasificación de queries en un sistema de diálogo

    Full text link
    [ES] En este trabajo se experimenta con tres técnicas de aumentados de datos para ayudar al clasificador de texto a incrementar su rendimiento. Para la experimentación se han seleccionado un gran número de corpus donde la mayoría de ellos pertenecen al dominio de la detección de intents, ya que, el objetivo de la investigación realizada en este proyecto es aprovechar el conocimiento adquirido para posteriormente aplicarlo en chatbots dedicados a servicio al cliente. Estos chatbots tendrán que detectar los intents de las queries enviadas por los usuarios para posteriormente responder consecuentemente. Los dos modelos empleados en la experimentación son de naturaleza distinta. El primero de ellos es XGBoost que es un modelo de aprendizaje automático clásico y el segundo de ellos utiliza la versión pre-entrenada de RoBERTa que es un modelo aprendizaje automático profundo, el cual actualmente es el estado del arte en la clasificación de texto. Finalmente, vemos que el uso de estas técnicas no aporta una mejora considerable con respecto de no utilizarlas.[EN] In this work we experiment with three data augmentation techniques in order to help the classifier to improve its performance. For the experimentation we chose a big number of corpus where the majority of them belong to the field of intent classification, because the knowledge obtained in this research will be used for applying in a customer service chatbot. This chatbot will answer the queries of the user, but previously it will have to detect the intent correctly. The two models that we use for the experimentation belong to different nature. The first one is XGBoost that belongs to the classical machine learning models, and the second one uses the pre-trained version of RoBERTa that belongs to the deep learning models, that actually are the state of the art in text classification. Finally, we will see that the use of these data augmentation techniques in natural language processing do not help to improve considerably the performance of the classifiers.Valero Antón, FDB. (2020). Generación no supervisada de datos para la clasificación de queries en un sistema de diálogo. http://hdl.handle.net/10251/151669TFG
    corecore