60 research outputs found

    Exploiting Universum data in AdaBoost using gradient descent

    Full text link
    Recently, Universum data that does not belong to any class of the training data, has been applied for training better classifiers. In this paper, we address a novel boosting algorithm called UAdaBoost that can improve the classification performance of AdaBoost with Universum data. UAdaBoost chooses a function by minimizing the loss for labeled data and Universum data. The cost function is minimized by a greedy, stagewise, functional gradient procedure. Each training stage of UAdaBoost is fast and efficient. The standard AdaBoost weights labeled samples during training iterations while UAdaBoost gives an explicit weighting scheme for Universum samples as well. In addition, this paper describes the practical conditions for the effectiveness of Universum learning. These conditions are based on the analysis of the distribution of ensemble predictions over training samples. Experiments on handwritten digits classification and gender classification problems are presented. As exhibited by our experimental results, the proposed method can obtain superior performances over the standard AdaBoost by selecting proper Universum data. © 2014 Elsevier B.V

    Breast Cancer Classification by Gene Expression Analysis using Hybrid Feature Selection and Hyper-heuristic Adaptive Universum Support Vector Machine

    Get PDF
    Comprehensive assessments of the molecular characteristics of breast cancer from gene expression patterns can aid in the early identification and treatment of tumor patients. The enormous scale of gene expression data obtained through microarray sequencing increases the difficulty of training the classifier due to large-scale features. Selecting pivotal gene features can minimize high dimensionality and the classifier complexity with improved breast cancer detection accuracy. However, traditional filter and wrapper-based selection methods have scalability and adaptability issues in handling complex gene features. This paper presents a hybrid feature selection method of Mutual Information Maximization - Improved Moth Flame Optimization (MIM-IMFO) for gene selection along with an advanced Hyper-heuristic Adaptive Universum Support classification model Vector Machine (HH-AUSVM) to improve cancer detection rates. The hybrid gene selection method is developed by performing filter-based selection using MIM in the first stage followed by the wrapper method in the second stage, to obtain the pivotal features and remove the inappropriate ones. This method improves standard MFO by a hybrid exploration/exploitation phase to accomplish a better trade-off between exploration and exploitation phases. The classifier HH-AUSVM is formulated by integrating the Adaptive Universum learning approach to the hyper- heuristics-based parameter optimized SVM to tackle the class samples imbalance problem. Evaluated on breast cancer gene expression datasets from Mendeley Data Repository, this proposed MIM-IMFO gene selection-based HH-AUSVM classification approach provided better breast cancer detection with high accuracies of 95.67%, 96.52%, 97.97% and 95.5% and less processing time of 4.28, 3.17, 9.45 and 6.31 seconds, respectively

    Applicability of semi-supervised learning assumptions for gene ontology terms prediction

    Get PDF
    Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version

    Evolving GANs: When Contradictions Turn into Compliance

    Full text link
    Limited availability of labeled-data makes any supervised learning problem challenging. Alternative learning settings like semi-supervised and universum learning alleviate the dependency on labeled data, but still require a large amount of unlabeled data, which may be unavailable or expensive to acquire. GAN-based synthetic data generation methods have recently shown promise by generating synthetic samples to improve task at hand. However, these samples cannot be used for other purposes. In this paper, we propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data. This provides the added advantage that now the generated data can be used for other similar tasks. We provide the theoretical guarantees and empirical results in support of our approach.Comment: Generative Adversarial Networks, Universum Learning, Semi-Supervised Learnin

    An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

    Get PDF
    Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

    Support matrix machine: A review

    Full text link
    Support vector machine (SVM) is one of the most studied paradigms in the realm of machine learning for classification and regression problems. It relies on vectorized input data. However, a significant portion of the real-world data exists in matrix format, which is given as input to SVM by reshaping the matrices into vectors. The process of reshaping disrupts the spatial correlations inherent in the matrix data. Also, converting matrices into vectors results in input data with a high dimensionality, which introduces significant computational complexity. To overcome these issues in classifying matrix input data, support matrix machine (SMM) is proposed. It represents one of the emerging methodologies tailored for handling matrix input data. The SMM method preserves the structural information of the matrix data by using the spectral elastic net property which is a combination of the nuclear norm and Frobenius norm. This article provides the first in-depth analysis of the development of the SMM model, which can be used as a thorough summary by both novices and experts. We discuss numerous SMM variants, such as robust, sparse, class imbalance, and multi-class classification models. We also analyze the applications of the SMM model and conclude the article by outlining potential future research avenues and possibilities that may motivate academics to advance the SMM algorithm

    Semi-supervised machine learning techniques for classification of evolving data in pattern recognition

    Get PDF
    The amount of data recorded and processed over recent years has increased exponentially. To create intelligent systems that can learn from this data, we need to be able to identify patterns hidden in the data itself, learn these pattern and predict future results based on our current observations. If we think about this system in the context of time, the data itself evolves and so does the nature of the classification problem. As more data become available, different classification algorithms are suitable for a particular setting. At the beginning of the learning cycle when we have a limited amount of data, online learning algorithms are more suitable. When truly large amounts of data become available, we need algorithms that can handle large amounts of data that might be only partially labeled as a result of the bottleneck in the learning pipeline from human labeling of the data. An excellent example of evolving data is gesture recognition, and it is present throughout our work. We need a gesture recognition system to work fast and with very few examples at the beginning. Over time, we are able to collect more data and the system can improve. As the system evolves, the user expects it to work better and not to have to become involved when the classifier is unsure about decisions. This latter situation produces additional unlabeled data. Another example of an application is medical classification, where experts’ time is a rare resource and the amount of received and labeled data disproportionately increases over time. Although the process of data evolution is continuous, we identify three main discrete areas of contribution in different scenarios. When the system is very new and not enough data are available, online learning is used to learn after every single example and to capture the knowledge very fast. With increasing amounts of data, offline learning techniques are applicable. Once the amount of data is overwhelming and the teacher cannot provide labels for all the data, we have another setup that combines labeled and unlabeled data. These three setups define our areas of contribution; and our techniques contribute in each of them with applications to pattern recognition scenarios, such as gesture recognition and sketch recognition. An online learning setup significantly restricts the range of techniques that can be used. In our case, the selected baseline technique is the Evolving TS-Fuzzy Model. The semi-supervised aspect we use is a relation between rules created by this model. Specifically, we propose a transductive similarity model that utilizes the relationship between generated rules based on their decisions about a query sample during the inference time. The activation of each of these rules is adjusted according to the transductive similarity, and the new decision is obtained using the adjusted activation. We also propose several new variations to the transductive similarity itself. Once the amount of data increases, we are not limited to the online learning setup, and we can take advantage of the offline learning scenario, which normally performs better than the online one because of the independence of sample ordering and global optimization with respect to all samples. We use generative methods to obtain data outside of the training set. Specifically, we aim to improve the previously mentioned TS Fuzzy Model by incorporating semi-supervised learning in the offline learning setup without unlabeled data. We use the Universum learning approach and have developed a method called UFuzzy. This method relies on artificially generated examples with high uncertainty (Universum set), and it adjusts the cost function of the algorithm to force the decision boundary to be close to the Universum data. We were able to prove the hypothesis behind the design of the UFuzzy classifier that Universum learning can improve the TS Fuzzy Model and have achieved improved performance on more than two dozen datasets and applications. With increasing amounts of data, we use the last scenario, in which the data comprises both labeled data and additional non-labeled data. This setting is one of the most common ones for semi-supervised learning problems. In this part of our work, we aim to improve the widely popular tecjniques of self-training (and its successor help-training) that are both meta-frameworks over regular classifier methods but require probabilistic representation of output, which can be hard to obtain in the case of discriminative classifiers. Therefore, we develop a new algorithm that uses the modified active learning technique Query-by-Committee (QbC) to sample data with high certainty from the unlabeled set and subsequently embed them into the original training set. Our new method allows us to achieve increased performance over both a range of datasets and a range of classifiers. These three works are connected by gradually relaxing the constraints on the learning setting in which we operate. Although our main motivation behind the development was to increase performance in various real-world tasks (gesture recognition, sketch recognition), we formulated our work as general methods in such a way that they can be used outside a specific application setup, the only restriction being that the underlying data evolve over time. Each of these methods can successfully exist on its own. The best setting in which they can be used is a learning problem where the data evolve over time and it is possible to discretize the evolutionary process. Overall, this work represents a significant contribution to the area of both semi-supervised learning and pattern recognition. It presents new state-of-the-art techniques that overperform baseline solutions, and it opens up new possibilities for future research

    Exploiting separability in large-scale linear support vector machine training

    Get PDF
    Linear support vector machine training can be represented as a large quadratic program. We present an efficient and numerically stable algorithm for this problem using interior point methods, which requires only O(n) operations per iteration. Through exploiting the separability of the Hessian, we provide a unified approach, from an optimization perspective, to 1-norm classification, 2-norm classification, universum classification, ordinal regression and ɛ-insensitive regression. Our approach has the added advantage of obtaining the hyperplane weights and bias directly from the solver. Numerical experiments indicate that, in contrast to existing methods, the algorithm is largely unaffected by noisy data, and they show training times for our implementation are consistent and highly competitive. We discuss the effect of using multiple correctors, and monitoring the angle of the normal to the hyperplane to determine termination

    Class-Imbalanced Complementary-Label Learning via Weighted Loss

    Full text link
    Complementary-label learning (CLL) is widely used in weakly supervised classification, but it faces a significant challenge in real-world datasets when confronted with class-imbalanced training samples. In such scenarios, the number of samples in one class is considerably lower than in other classes, which consequently leads to a decline in the accuracy of predictions. Unfortunately, existing CLL approaches have not investigate this problem. To alleviate this challenge, we propose a novel problem setting that enables learning from class-imbalanced complementary labels for multi-class classification. To tackle this problem, we propose a novel CLL approach called Weighted Complementary-Label Learning (WCLL). The proposed method models a weighted empirical risk minimization loss by utilizing the class-imbalanced complementary labels, which is also applicable to multi-class imbalanced training samples. Furthermore, we derive an estimation error bound to provide theoretical assurance. To evaluate our approach, we conduct extensive experiments on several widely-used benchmark datasets and a real-world dataset, and compare our method with existing state-of-the-art methods. The proposed approach shows significant improvement in these datasets, even in the case of multiple class-imbalanced scenarios. Notably, the proposed method not only utilizes complementary labels to train a classifier but also solves the problem of class imbalance.Comment: 9 pages, 9 figures, 3 table
    • 

    corecore