21 research outputs found

    Web classification using Support Vector Machine

    Get PDF

    PENGGUNAAN METODE SUPPORT VECTOR MACHINE UNTUK MENGKLASIFIKASI DAN MEMPREDIKSI ANGKUTAN UDARA JENIS PENERBANGAN DOMESTIK DAN PENERBANGAN INTERNASIONAL DI BANDA ACEH

    Get PDF
    Penelitian ini menyajikan analisis performansi Support Vector Machine(SVM) dengan 11 variabel bebas dan 1 variabel terikat. Metode SVMdengan data training (75%) dan data testing (25%) yang digunakan pada pengklasifikasian data Penerbangan domestic dan data penerbanganinternasional untuk menemukan hyperplane terbaik yang memisahkandua buah kelas. Hasilnya terdapat 4 support vector memberikan informasiyang dibutuhkan untuk menyakinkan bahwa metode SVM bias sebagaiclassifier dan dapat memprediksi keakuratan model dengan menggunakankurva Receiver Operating Characteristic (ROC) untuk melihat akurasi modelterbaik. mencapai 84,31%. Kata Kunci: Klasifikasi, Metode Support Vector Machine (SVM), Receiver Operating Characteristic (ROC)

    A Classifier to Detect Profit and Non Profit Websites Upon Textual Metrics for Security Purposes

    Get PDF
    Currently, most organizations have a defense system to protect their digital communication network against cyberattacks. However, these defense systems deal with all network traffic regardless if it is from profit or non-profit websites. This leads to enforcing more security policies, which negatively affects network speed. Since most dangerous cyberattacks are aimed at commercial websites, because they contain more critical data such as credit card numbers, it is better to set up the defense system priorities towards actual attacks that come from profit websites. This study evaluated the effect of textual website metrics in determining the type of website as profit or nonprofit for security purposes. Classifiers were built to predict the type of website as profit or non-profit by applying machine learning techniques on a dataset. The corpus used for this research included profit and non-profit websites. Both traditional and deep machine learning techniques were applied. The results showed that J48 performed best in terms of accuracy according to its outcomes in all cases. The newly built models can be a significant tool for defense systems of organizations, as they will help them to implement the necessary security policies associated with attacks that come from both profit and non-profit websites. This will have a positive impact on the security and efficiency of the network

    Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification

    Get PDF
    This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers

    An unsupervised perplexity-based method for boilerplate removal

    Get PDF
    The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approachS

    Web Unit Mining: Finding and classifying subgraphs of web pages

    Get PDF
    In web classification, most researchers assume that the ob-jects to classify are individual web pages from one or more web sites. In practice, the assumption is too restrictive since a web page itself may not always correspond to a concept instance of some semantic concept (or category) given to the classification task. In this paper, we want to relax this as-sumption and allow a concept instance to be represented by a subgraph of web pages or a set of web pages. We identify several new issues to be addressed when the assumption is removed, and formulate theweb unit mining problem. We also propose an iterative web unit mining (iWUM) method that first finds subgraphs of web pages using some knowledge about web site structure. From these web subgraphs, web units are constructed and classified into semantic concepts (or categories) in an iterative manner. Our experiments us-ing the WebKB dataset showed that iWUM improves the overall classification performance and works very well on the more structured parts of a web site

    Detección de discurso de odio online utilizando Machine Learning

    Get PDF
    Trabajo de Fin de Grado en Ingeniería informática, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2021/2022. Enlace al repositorio público del proyecto: https://github.com/NILGroup/TFG-2122HateSpeechDetectionHate speech directed towards marginalized people is a very common problem online, especially in social media such as Twitter or Reddit. Automatically detecting hate speech in such spaces can help mend the Internet and transform it into a safer environment for everybody. Hate speech detection fits into text classification, a series of tasks where text is organized into categories. This project2 proposes using Machine Learning algorithms to detect hate speech in online text in four languages: English, Spanish, Italian and Portuguese. The data to train the models was obtained from online, publicly available datasets. Three different algorithms with varying parameters have been used in order to compare their performance. The experiments show that the best results reach an 82.51% accuracy and around an 83% F1-score, for Italian text. Each language has different results depending on distinct factors.El discurso de odio dirigido a personas marginadas es un problema muy común en línea, especialmente en redes sociales como Twitter o Reddit. La detección automática del discurso de odio en dichos espacios puede ayudar a reparar Internet y a transformarlo en un entorno más seguro para todos. La detección del discurso de odio encaja en la clasificación de texto, donde se organiza en categorías. Este proyecto1 propone el uso de algoritmos de Machine Learning para localizar discurso de odio en textos online en cuatro idiomas: inglés, español, italiano y portugués. Los datos para entrenar los modelos se obtuvieron de datasets disponibles públicamente en línea. Se han utilizado tres algoritmos diferentes con distintos parámetros para comparar su rendimiento. Los experimentos muestran que los mejores resultados alcanzan una precisión del 82,51 % y un valor F1 de alrededor del 83 % en italiano. Los resultados para cada idioma varían dependiendo de distintos factores.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu
    corecore