23 research outputs found

    An Information Theoretic Approach to Quantify the Stability of Feature Selection and Ranking Algorithms

    Get PDF
    [EN] Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrel- evant features. A problem that arises in many of these applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations, but it has been long overlooked in the literature. We propose an information-theoretic approach based on the Jensen-Shannon di-vergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, top-k lists (feature subsets) as well as the lesser studied partial ranked lists that keep the k best ranked elements. This generalized metric quantifies the dif-ference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties for a stability metric including correction for change, and upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman’s rank correlation and the Kuncheva’s index on feature ranking and selection outcomes respectively.S

    Class Distribution Estimation in Imprecise Domains Based on Supervised Learning

    Get PDF
    cap. 9- pp. 187-202a cuantificación -o estimación de proporciones- desempeña un papel importante en muchos problemas prácticos de clasificación. Por un lado, una máquina que clasifica automáticamente un elemento en un grupo de clases predefinidas, tomará decisiones subóptimas, si la distribución de clases en el dominio de prueba (real) difiere de la que se asume en el aprendizaje. La estimación de la nueva distribución de clases es necesaria para adaptar el clasificador a las nuevas condiciones operativas. Por otro lado, hay algunos dominios reales donde la propia tarea de cuantificación es el objetivo principal. Algunos campos, como el control de calidad, el marketing directo, el estudio de tendencias o algunas tareas de reconocimiento textual, requieren métodos que puedan estimar de forma fiable, la proporción de elementos dentro de cada categoría, sin ninguna preocupación acerca de cómo cada elemento ha sido clasificado individualmente. Describimos varias técnicas de cuantificación que se basan en el aprendizaje supervisado y proporcionan estas estimaciones basadas en: a) la matriz de confusión del clasificador, b) las estimaciones de probabilidad posteriores y c) las medidas de divergencia distribucional. Ilustramos estas técnicas, así como su robustez contra el rendimiento del clasificador base, en un entorno práctico de control de calidad seminal donde el objetivo final es cuantificar la proporción de espermatozoides con acrosoma dañado/intacto

    Phishing websites detection using a novel multipurpose dataset and web technologies features

    Get PDF
    [EN] Phishing attacks are one of the most challenging social engineering cyberattacks due to the large amount of entities involved in online transactions and services. In these attacks, criminals deceive users to hijack their credentials or sensitive data through a login form which replicates the original website and submits the data to a malicious server. Many anti-phishing techniques have been developed in recent years, using different resource such as the URL and HTML code from legitimate index websites and phishing ones. These techniques have some limitations when predicting legitimate login websites, since, usually, no login forms are present in the legitimate class used for training the proposed model. Hence, in this work we present a methodology for phishing website detection in real scenarios, which uses URL, HTML, and web technology features. Since there is not any updated and multipurpose dataset for this task, we crafted the Phishing Index Login Websites Dataset (PILWD), an offline phishing dataset composed of 134,000 verified samples, that offers to researchers a wide variety of data to test and compare their approaches. Since approximately three-quarters of collected phishing samples request the introduction of credentials, we decided to crawl legitimate login websites to match the phishing standpoint. The developed approach is independent of third party services and the method relies on a new set of features used for the very first time in this problem, some of them extracted from the web technologies used by the on each specific website. Experimental results show that phishing websites can be detected with 97.95% accuracy using a LightGBM classifier and the complete set of the 54 features selected, when it was evaluated on PILWD dataset.SIINCIBEUniversidad de Leó

    Tool wear monitoring using an online, automatic and low cost system based on local texture

    Get PDF
    [EN] In this work we propose a new online, low cost and fast approach based on computer vision and machine learning to determine whether cutting tools used in edge pro le milling processes are serviceable or disposable based on their wear level. We created a new dataset of 254 images of edge pro le cutting heads which is, to the best of our knowledge, the rst publicly available dataset with enough quality for this purpose. All the inserts were segmented and their cutting edges were cropped, obtaining 577 images of cutting edges: 301 functional and 276 disposable. The proposed method is based on (1) dividing the cutting edge image in di erent regions, called Wear Patches (WP), (2) characterising each one as worn or serviceable using texture descriptors based on di erent variants of Local Binary Patterns (LBP) and (3) determine, based on the state of these WP, if the cutting edge (and, therefore, the tool) is serviceable or disposable. We proposed and assessed ve di erent patch division con gurations. The individual WP were classi ed by a Support Vector Machine (SVM) with an intersection kernel. The best patch division con guration and texture descriptor for the WP achieves an accuracy of 90.26% in the detection of the disposable cutting edges. These results show a very promising opportunity for automatic wear monitoring in edge pro le milling processes. Keywords: Tool wear, texture descriptionS

    Combining shape and contour features to improve tool wear monitoring in milling processes

    Get PDF
    [EN] In this paper, a new system based on combinations of a shape descriptor and a contour descriptor has been proposed for classifying inserts in milling processes according to their wear level following a computer vision based approach. To describe the wear region shape we have proposed a new descriptor called ShapeFeat and its contour has been characterized using the method BORCHIZ that, to the best of our knowledge, achieves the best performance for tool wear monitoring following a computer vision-based approach. Results show that the combination of BORCHIZ with ShapeFeat using a late fusion method improves the classification performance significantly, obtaining an accuracy of 91.44% in the binary classification (i.e. the classification of the wear as high or low) and 82.90% using three target classes (i.e. classification of the wear as high, medium or low). These results outperform the ones obtained by both descriptors used on their own, which achieve accuracies of 88.70 and 80.67% for two and three classes, respectively, using ShapeFeat and 87.06 and 80.24% with B-ORCHIZ. This study yielded encouraging results for the manufacturing community in order to classify automatically the inserts in terms of their wear for milling processes.S

    Clasificación y reconocimiento de patrones

    Get PDF
    Cap. 9, pp. 159-179En este capítulo se presentan las ideas básicas de la etapa de clasificación en un sistema de reconocimiento de patrones. Comienza el capítulo recordando los fundamentos del aprendizaje a partir de ejemplos para, posteriormente, hacer una revisión de las métricas y métodos más habituales de evaluación del rendimiento de un clasificador. El capítulo continúa mostrando el ciclo completo de diseño de un clasificador y finalmente, se describen, a modo de ilustración, tres modelos de aprendizaje correspondientes a los enfoques de clasificación supervisada, regresión y clasificación no supervisada

    A review of spam email detection: analysis of spammer strategies and the dataset shift problem

    Get PDF
    .Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

    Get PDF
    [EN] Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.S
    corecore