47 research outputs found
Exploiting Universum data in AdaBoost using gradient descent
Recently, Universum data that does not belong to any class of the training data, has been applied for training better classifiers. In this paper, we address a novel boosting algorithm called UAdaBoost that can improve the classification performance of AdaBoost with Universum data. UAdaBoost chooses a function by minimizing the loss for labeled data and Universum data. The cost function is minimized by a greedy, stagewise, functional gradient procedure. Each training stage of UAdaBoost is fast and efficient. The standard AdaBoost weights labeled samples during training iterations while UAdaBoost gives an explicit weighting scheme for Universum samples as well. In addition, this paper describes the practical conditions for the effectiveness of Universum learning. These conditions are based on the analysis of the distribution of ensemble predictions over training samples. Experiments on handwritten digits classification and gender classification problems are presented. As exhibited by our experimental results, the proposed method can obtain superior performances over the standard AdaBoost by selecting proper Universum data. © 2014 Elsevier B.V
An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis
Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced
classification. Furthermore, data characteristics have a significant impact on the performance
of imbalanced classifiers, which are generally neglected by existing evaluation
methods. The objective of this study is to introduce a new criterion to comprehensively
evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established
using data envelopment analysis without explicit inputs (DEA-WEI), to determine
the trade-off between the benefits of improved minority class accuracy and the cost of
reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced
ratio and typical imbalanced data characteristics on the efficiency of the classifiers.
Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as
C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble
and undersampling techniques are more effective for overlapping and noisy data. The efficiency
of cost-sensitive classifiers decreases dramatically when the imbalanced ratio
increases. Finally, we investigate the reasons for the different efficiencies of classifiers on
imbalanced data and recommend steps to select appropriate classifiers for imbalanced data
based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023
71725001
71771037
7197104
Semi-supervised machine learning techniques for classification of evolving data in pattern recognition
The amount of data recorded and processed over recent years has increased exponentially. To create intelligent systems that can learn from this data, we need to be able to identify patterns hidden in the data itself, learn these pattern and predict future results based on our current observations. If we think about this system in the context of time, the data itself evolves and so does the nature of the classification problem. As more data become available, different classification algorithms are suitable for a particular setting. At the beginning of the learning cycle when we have a limited amount of data, online learning algorithms are more suitable. When truly large amounts of data become available, we need algorithms that can handle large amounts of data that might be only partially labeled as a result of the bottleneck in the learning pipeline from human labeling of the data.
An excellent example of evolving data is gesture recognition, and it is present throughout our work. We need a gesture recognition system to work fast and with very few examples at the beginning. Over time, we are able to collect more data and the system can improve. As the system evolves, the user expects it to work better and not to have to become involved when the classifier is unsure about decisions. This latter situation produces additional unlabeled data. Another example of an application is medical classification, where expertsâ time is a rare resource and the amount of received and labeled data disproportionately increases over time.
Although the process of data evolution is continuous, we identify three main discrete areas of contribution in different scenarios. When the system is very new and not enough data are available, online learning is used to learn after every single example and to capture the knowledge very fast. With increasing amounts of data, offline learning techniques are applicable. Once the amount of data is overwhelming and the teacher cannot provide labels for all the data, we have another setup that combines labeled and unlabeled data. These three setups define our areas of contribution; and our techniques contribute in each of them with applications to pattern recognition scenarios, such as gesture recognition and sketch recognition.
An online learning setup significantly restricts the range of techniques that can be used. In our case, the selected baseline technique is the Evolving TS-Fuzzy Model. The semi-supervised aspect we use is a relation between rules created by this model. Specifically, we propose a transductive similarity model that utilizes the relationship between generated rules based on their decisions about a query sample during the inference time. The activation of each of these rules is adjusted according to the transductive similarity, and the new decision is obtained using the adjusted activation. We also propose several new variations to the transductive similarity itself.
Once the amount of data increases, we are not limited to the online learning setup, and we can take advantage of the offline learning scenario, which normally performs better than the online one because of the independence of sample ordering and global optimization with respect to all samples. We use generative methods to obtain data outside of the training set. Specifically, we aim to improve the previously mentioned TS Fuzzy Model by incorporating semi-supervised learning in the offline learning setup without unlabeled data. We use the Universum learning approach and have developed a method called UFuzzy. This method relies on artificially generated examples with high uncertainty (Universum set), and it adjusts the cost function of the algorithm to force the decision boundary to be close to the Universum data. We were able to prove the hypothesis behind the design of the UFuzzy classifier that Universum learning can improve the TS Fuzzy Model and have achieved improved performance on more than two dozen datasets and applications.
With increasing amounts of data, we use the last scenario, in which the data comprises both labeled data and additional non-labeled data. This setting is one of the most common ones for semi-supervised learning problems. In this part of our work, we aim to improve the widely popular tecjniques of self-training (and its successor help-training) that are both meta-frameworks over regular classifier methods but require probabilistic representation of output, which can be hard to obtain in the case of discriminative classifiers. Therefore, we develop a new algorithm that uses the modified active learning technique Query-by-Committee (QbC) to sample data with high certainty from the unlabeled set and subsequently embed them into the original training set. Our new method allows us to achieve increased performance over both a range of datasets and a range of classifiers.
These three works are connected by gradually relaxing the constraints on the learning setting in which we operate. Although our main motivation behind the development was to increase performance in various real-world tasks (gesture recognition, sketch recognition), we formulated our work as general methods in such a way that they can be used outside a specific application setup, the only restriction being that the underlying data evolve over time. Each of these methods can successfully exist on its own. The best setting in which they can be used is a learning problem where the data evolve over time and it is possible to discretize the evolutionary process.
Overall, this work represents a significant contribution to the area of both semi-supervised learning and pattern recognition. It presents new state-of-the-art techniques that overperform baseline solutions, and it opens up new possibilities for future research
Support matrix machine: A review
Support vector machine (SVM) is one of the most studied paradigms in the
realm of machine learning for classification and regression problems. It relies
on vectorized input data. However, a significant portion of the real-world data
exists in matrix format, which is given as input to SVM by reshaping the
matrices into vectors. The process of reshaping disrupts the spatial
correlations inherent in the matrix data. Also, converting matrices into
vectors results in input data with a high dimensionality, which introduces
significant computational complexity. To overcome these issues in classifying
matrix input data, support matrix machine (SMM) is proposed. It represents one
of the emerging methodologies tailored for handling matrix input data. The SMM
method preserves the structural information of the matrix data by using the
spectral elastic net property which is a combination of the nuclear norm and
Frobenius norm. This article provides the first in-depth analysis of the
development of the SMM model, which can be used as a thorough summary by both
novices and experts. We discuss numerous SMM variants, such as robust, sparse,
class imbalance, and multi-class classification models. We also analyze the
applications of the SMM model and conclude the article by outlining potential
future research avenues and possibilities that may motivate academics to
advance the SMM algorithm
Drawing, Handwriting Processing Analysis: New Advances and Challenges
International audienceDrawing and handwriting are communicational skills that are fundamental in geopolitical, ideological and technological evolutions of all time. drawingand handwriting are still useful in defining innovative applications in numerous fields. In this regard, researchers have to solve new problems like those related to the manner in which drawing and handwriting become an efficient way to command various connected objects; or to validate graphomotor skills as evident and objective sources of data useful in the study of human beings, their capabilities and their limits from birth to decline
A New Under-Sampling Method to Face Class Overlap and Imbalance
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases
Using Interior Point Methods for Large-scale Support Vector Machine training
Support Vector Machines (SVMs) are powerful machine learning techniques for classification
and regression, but the training stage involves a convex quadratic optimization program
that is most often computationally expensive. Traditionally, active-set methods have been
used rather than interior point methods, due to the Hessian in the standard dual formulation
being completely dense. But as active-set methods are essentially sequential, they may not
be adequate for machine learning challenges of the future. Additionally, training time may be
limited, or data may grow so large that cluster-computing approaches need to be considered.
Interior point methods have the potential to answer these concerns directly. They scale
efficiently, they can provide good early approximations, and they are suitable for parallel
and multi-core environments. To apply them to SVM training, it is necessary to address
directly the most computationally expensive aspect of the algorithm. We therefore present an
exact reformulation of the standard linear SVM training optimization problem that exploits
separability of terms in the objective. By so doing, per-iteration computational complexity
is reduced from O(n3) to O(n). We show how this reformulation can be applied to many
machine learning problems in the SVM family.
Implementation issues relating to specializing the algorithm are explored through extensive
numerical experiments. They show that the performance of our algorithm for large dense
or noisy data sets is consistent and highly competitive, and in some cases can out perform all
other approaches by a large margin. Unlike active set methods, performance is largely unaffected
by noisy data. We also show how, by exploiting the block structure of the augmented
system matrix, a hybrid MPI/Open MP implementation of the algorithm enables data and
linear algebra computations to be efficiently partitioned amongst parallel processing nodes
in a clustered computing environment.
The applicability of our technique is extended to nonlinear SVMs by low-rank approximation
of the kernel matrix. We develop a heuristic designed to represent clusters using a
small number of features. Additionally, an early approximation scheme reduces the number of samples that need to be considered. Both elements improve the computational efficiency
of the training phase.
Taken as a whole, this thesis shows that with suitable problem formulation and efficient
implementation techniques, interior point methods are a viable optimization technology to
apply to large-scale SVM training, and are able to provide state-of-the-art performance
Selection of contributing factors for predicting landslide susceptibility using machine learning and deep learning models
Landslides are a common natural disaster that can cause casualties, property
safety threats and economic losses. Therefore, it is important to understand or
predict the probability of landslide occurrence at potentially risky sites. A
commonly used means is to carry out a landslide susceptibility assessment based
on a landslide inventory and a set of landslide contributing factors. This can
be readily achieved using machine learning (ML) models such as logistic
regression (LR), support vector machine (SVM), random forest (RF), extreme
gradient boosting (Xgboost), or deep learning (DL) models such as convolutional
neural network (CNN) and long short time memory (LSTM). As the input data for
these models, landslide contributing factors have varying influences on
landslide occurrence. Therefore, it is logically feasible to select more
important contributing factors and eliminate less relevant ones, with the aim
of increasing the prediction accuracy of these models. However, selecting more
important factors is still a challenging task and there is no generally
accepted method. Furthermore, the effects of factor selection using various
methods on the prediction accuracy of ML and DL models are unclear. In this
study, the impact of the selection of contributing factors on the accuracy of
landslide susceptibility predictions using ML and DL models was investigated.
Four methods for selecting contributing factors were considered for all the
aforementioned ML and DL models, which included Information Gain Ratio (IGR),
Recursive Feature Elimination (RFE), Particle Swarm Optimization (PSO), Least
Absolute Shrinkage and Selection Operators (LASSO) and Harris Hawk Optimization
(HHO). In addition, autoencoder-based factor selection methods for DL models
were also investigated. To assess their performances, an exhaustive approach
was adopted,...Comment: Stochastic Environmental Research and Risk Assessmen
IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification
Imbalanced learning problems are a challenge faced by classifiers when data samples have an unbalanced distribution among classes. The Synthetic Minority Over-Sampling Technique (SMOTE) is one of the most well-known data pre-processing methods. Problems that arise when oversampling with SMOTE are the phenomenon of noise, small disjunct samples, and overfitting due to a high imbalance ratio in a dataset. A high level of imbalance ratio and low variance conditions cause the results of synthetic data generation to be collected in narrow areas and conflicting regions among classes and make them susceptible to overfitting during the learning process by machine learning methods. Therefore, this research proposes a combination between Radius-SMOTE and Bagging Algorithm called the IRS-BAG Model. For each sub-sample generated by bootstrapping, oversampling was done using Radius SMOTE. Oversampling on the sub-sample was likely to overcome overfitting problems that might occur. Experiments were carried out by comparing the performance of the IRS-BAG model with various previous oversampling methods using the imbalanced public dataset. The experiment results using three different classifiers proved that all classifiers had gained a notable improvement when combined with the proposed IRS-BAG model compared with the previous state-of-the-art oversampling methods. Doi: 10.28991/ESJ-2023-07-05-04 Full Text: PD