93,617 research outputs found

    Robust classification of high dimensional unbalanced single and multi-label datasets

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Single and multi-label classification are arguably two of the most important topics within the field of machine learning. Single-label classification refers to the case where each sample is assigned to one class, and multi-label classification is where instances are associated with multiple labels simultaneously. Nowadays, research to build robust single and multi-label classification models is still ongoing in the data analytics community because of the emerging complexities in the real-world data, and due to the increasingly research interest in use of data analytics techniques in many fields including biomedicine, finance, text mining, text categorization, and images. Real-world datasets contain complexities which degrade the performance of classifiers. These complexities or open challenges are: imbalanced data, low numbers of samples, high-dimensionality, highly correlated features, label correlations, and missing labels in multi-label space. Several research gaps are identified and motivate this thesis. Class imbalance occurs when the distribution of classes is not uniform among samples. Feature extraction is used to reduce the dimensionality of data. However, the presence of highly imbalanced data in single-label classification misleads existing unsupervised and supervised feature extraction techniques. It produces features biased towards classification of the class with the majority of samples, and results in poor classification performance especially for the minor class. Furthermore, imbalanced multi-labeled data is more ubiquitous than single-labeled data because of several issues including label correlation, incomplete multi-label matrices, and noisy and irrelevant features. High-dimensional highly correlated data exist in several domains such as genomics. Many feature selection techniques consider correlated features as redundant and therefore need to be removed. Several studies investigate the interpretation of the correlated features in domains such as genomics, but investigating the classification capabilities of the correlated feature groups in single-labeled data is a point of interest in several domains. Moreover, high-dimensional multi-labeled data is more challenging than single-labeled data. Only relatively few feature selection methods have been proposed to select the discriminative features among multiple labels due to issues including interdependent labels, different instances sharing different label correlations, correlated features, and missing and noisy labels. This thesis proposes a series of novel algorithms for machine learning to handle the negative effects of the above mentioned problems and improves the performance of the classifiers in single and multi-labeled data. There are seven contributions in this thesis. Contribution 1 proposes novel cost-sensitive principal component analysis (CSPCA) and cost-sensitive non-negative matrix factorization (CSNMF) methods for handling feature extraction of imbalanced single-labeled data. Contribution 2 extends a standard non-negative matrix factorization to a balanced supervised non-negative matrix factorization (BSNMF) to handle the class imbalance problem in supervised non-negative matrix factorization. Contribution 3 introduces an ABC-Sampling algorithm for balancing imbalanced datasets based on Artificial Bee Colony algorithm. Contribution 4 develops a novel supervised feature selection algorithm (SCANMF) by jointly integrating correlation network and structural analysis of the balanced supervised non-negative matrix factorization to handle high-dimensional, highly correlated single-labeled data. Contribution 5 proposes an ensemble feature ranking method using co-expression networks to select optimal features for classification. Contribution 6 proposes a Correlated- and Multi-label Feature Selection method (CMFS), based on NMF for simultaneously performing multi-label feature selection and addressing the following challenges: interdependent labels, different instances sharing different label correlations, correlated features, and missing and awed labels. Contribution 7 presents an integrated multi-label approach (ML-CIB) for simultaneously training the multi-label classification model and addressing the following challenges namely, class imbalance, label correlation, incomplete multi-label matrices, and noisy and irrelevant features. The performance of all novel algorithms in this thesis is evaluated in terms of single and multi-label classification accuracy. The proposed algorithms are evaluated in the context of a childhood leukaemia dataset from The Children Hospital at Westmead, and public datasets for different fields including genomics, finance, text mining, images, and others from online repositories. Moreover, all the results of the proposed algorithms in this thesis are compared to state-of-the-art methods. The experimental results indicate that the proposed algorithms outperform the state-of-the-art methods. Further, several statistical tests including, t-test and Friedman test are applied to evaluate the results to demonstrate the statistical significance of the proposed methods in this thesis

    Learning Deep Latent Spaces for Multi-Label Classification

    Full text link
    Multi-label classification is a practical yet challenging task in machine learning related fields, since it requires the prediction of more than one label category for each input instance. We propose a novel deep neural networks (DNN) based model, Canonical Correlated AutoEncoder (C2AE), for solving this task. Aiming at better relating feature and label domain data for improved classification, we uniquely perform joint feature and label embedding by deriving a deep latent space, followed by the introduction of label-correlation sensitive loss function for recovering the predicted label outputs. Our C2AE is achieved by integrating the DNN architectures of canonical correlation analysis and autoencoder, which allows end-to-end learning and prediction with the ability to exploit label dependency. Moreover, our C2AE can be easily extended to address the learning problem with missing labels. Our experiments on multiple datasets with different scales confirm the effectiveness and robustness of our proposed method, which is shown to perform favorably against state-of-the-art methods for multi-label classification.Comment: published in AAAI-201

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    A two-step learning approach for solving full and almost full cold start problems in dyadic prediction

    Full text link
    Dyadic prediction methods operate on pairs of objects (dyads), aiming to infer labels for out-of-sample dyads. We consider the full and almost full cold start problem in dyadic prediction, a setting that occurs when both objects in an out-of-sample dyad have not been observed during training, or if one of them has been observed, but very few times. A popular approach for addressing this problem is to train a model that makes predictions based on a pairwise feature representation of the dyads, or, in case of kernel methods, based on a tensor product pairwise kernel. As an alternative to such a kernel approach, we introduce a novel two-step learning algorithm that borrows ideas from the fields of pairwise learning and spectral filtering. We show theoretically that the two-step method is very closely related to the tensor product kernel approach, and experimentally that it yields a slightly better predictive performance. Moreover, unlike existing tensor product kernel methods, the two-step method allows closed-form solutions for training and parameter selection via cross-validation estimates both in the full and almost full cold start settings, making the approach much more efficient and straightforward to implement
    corecore