8 research outputs found

    Embedded Multi-label Feature Selection via Orthogonal Regression

    Full text link
    In the last decade, embedded multi-label feature selection methods, incorporating the search for feature subsets into model optimization, have attracted considerable attention in accurately evaluating the importance of features in multi-label classification tasks. Nevertheless, the state-of-the-art embedded multi-label feature selection algorithms based on least square regression usually cannot preserve sufficient discriminative information in multi-label data. To tackle the aforementioned challenge, a novel embedded multi-label feature selection method, termed global redundancy and relevance optimization in orthogonal regression (GRROOR), is proposed to facilitate the multi-label feature selection. The method employs orthogonal regression with feature weighting to retain sufficient statistical and structural information related to local label correlations of the multi-label data in the feature learning process. Additionally, both global feature redundancy and global label relevancy information have been considered in the orthogonal regression model, which could contribute to the search for discriminative and non-redundant feature subsets in the multi-label data. The cost function of GRROOR is an unbalanced orthogonal Procrustes problem on the Stiefel manifold. A simple yet effective scheme is utilized to obtain an optimal solution. Extensive experimental results on ten multi-label data sets demonstrate the effectiveness of GRROOR

    Feature Selection and Overlapping Clustering-Based Multilabel Classification Model

    Get PDF
    Multilabel classification (MLC) learning, which is widely applied in real-world applications, is a very important problem in machine learning. Some studies show that a clustering-based MLC framework performs effectively compared to a nonclustering framework. In this paper, we explore the clustering-based MLC problem. Multilabel feature selection also plays an important role in classification learning because many redundant and irrelevant features can degrade performance and a good feature selection algorithm can reduce computational complexity and improve classification accuracy. In this study, we consider feature dependence and feature interaction simultaneously, and we propose a multilabel feature selection algorithm as a preprocessing stage before MLC. Typically, existing cluster-based MLC frameworks employ a hard cluster method. In practice, the instances of multilabel datasets are distinguished in a single cluster by such frameworks; however, the overlapping nature of multilabel instances is such that, in real-life applications, instances may not belong to only a single class. Therefore, we propose a MLC model that combines feature selection with an overlapping clustering algorithm. Experimental results demonstrate that various clustering algorithms show different performance for MLC, and the proposed overlapping clustering-based MLC model may be more suitable

    Hybrid Email Spam Detection Model Using Artificial Intelligence

    Get PDF
    The growing volume of spam Emails has generated the need for a more precise anti-spam filter to detect unsolicited Emails. One of the most common representations used in spam filters is the Bag-of-Words (BOW). Although BOW is very effective in the classification of the emails, it has a number of weaknesses. In this paper, we present a hybrid approach to spam filtering based on the Neural Network model Paragraph Vector-Distributed Memory (PV-DM). We use PV-DM to build up a compact representation of the context of an email and also of its pertinent features. This methodology represents a more comprehensive filter for classifying Emails. Furthermore, we have conducted an empirical experiment using Enron spam and Ling spam datasets, the results of which indicate that our proposed filter outperforms the PV-DM and the BOW email classification methods

    A Kernel Partial Least Square Based Feature Selection Method

    Get PDF
    Maximum relevance and minimum redundancy (mRMR) has been well recognised as one of the best feature selection methods. This paper proposes a Kernel Partial Least Square (KPLS) based mRMR method, aiming for easy computation and improving classification accuracy for high-dimensional data. Experiments with this approach have been conducted on seven real-life datasets of varied dimensionality and number of instances, with performance measured on four different classifiers: Naive Bayes, Linear Discriminant Analysis, Random Forest and Support Vector Machine. Experimental results have exhibited the advantage of the proposed method over several competing feature selection techniques

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

    Aprendizaje multi-etiqueta distribuido en Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classication and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in di_erent scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets con_rm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.Esta Tesis Doctoral propone unos algoritmos de clasificación y selección de atributos para aprendizaje multi-etiqueta distribuidos implementados en Apache Spark. Cinco estrategias para determinar la arquitectura óptima para acelerar el aprendizaje multi-etiqueta son presentadas. Estas estrategias varían desde la paralelización local utilizando hilos hasta la distribución de la computación utilizando espacios de memoria compartidos o independientes. Ha sido demostrado que la estrategia óptima permite ejecutar cientos de veces más rápido que el método de referencia. Se proponen tres métodos distribuidos de \k nearest neighbors" multi-etiqueta sobre la arquitectura de Spark seleccionada: un método exacto que computa iterativamente las distancias, un método aproximado que usa un árbol para indexar las instancias, y un método aproximado que utiliza tablas hash para indexar las instancias. Los resultados indican que las predicciones del método basado en árboles son equivalente a aquellas producidas por un método exacto a la vez que reduce los tiempos de ejecución en todos los escenarios. Dicho método es utilizado para evaluar la calidad de un subconjunto de atributos. Se discute el criterio para seleccionar atributos en problemas multi-etiqueta, proponiendo: un método que selecciona el subconjunto de atributos cuyas medidas de información individuales poseen la mayor norma Euclídea, y un método que selecciona el subconjunto de atributos con la mayor media geométrica. Los resultados indican que cada método destaca en escenarios diferentes dependiendo del tipo de atributos y el número de etiquetas. Los estudios experimentales y análisis estadísticos utilizando múltiples métricas y datos multi-etiqueta confirman que nuestras propuestas alcanzan un mejor rendimiento y proporcionan una mejor escalabilidad para datos de gran tamaño respecto a los métodos de referencia