3,379 research outputs found
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
Multiscale and inhomogeneous molecular systems are challenging topics in the
field of molecular simulation. In particular, modeling biological systems in
the context of multiscale simulations and exploring material properties are
driving a permanent development of new simulation methods and optimization
algorithms. In computational terms, those methods require parallelization
schemes that make a productive use of computational resources for each
simulation and from its genesis. Here, we introduce the heterogeneous domain
decomposition approach which is a combination of an heterogeneity sensitive
spatial domain decomposition with an \textit{a priori} rearrangement of
subdomain-walls. Within this approach, the theoretical modeling and
scaling-laws for the force computation time are proposed and studied as a
function of the number of particles and the spatial resolution ratio. We also
show the new approach capabilities, by comparing it to both static domain
decomposition algorithms and dynamic load balancing schemes. Specifically, two
representative molecular systems have been simulated and compared to the
heterogeneous domain decomposition proposed in this work. These two systems
comprise an adaptive resolution simulation of a biomolecule solvated in water
and a phase separated binary Lennard-Jones fluid.Comment: 14 pages, 12 figure
Applications of lattice QCD techniques for condensed matter systems
We review the application of lattice QCD techniques, most notably the Hybrid
Monte-Carlo (HMC) simulations, to first-principle study of tight-binding models
of crystalline solids with strong inter-electron interactions. After providing
a basic introduction into the HMC algorithm as applied to condensed matter
systems, we review HMC simulations of graphene, which in the recent years have
helped to understand the semi-metal behavior of clean suspended graphene at the
quantitative level. We also briefly summarize other novel physical results
obtained in these simulations. Then we comment on the applicability of Hybrid
Monte-Carlo to topological insulators and Dirac and Weyl semi-metals and
highlight some of the relevant open physical problems. Finally, we also touch
upon the lattice strong-coupling expansion technique as applied to condensed
matter systems.Comment: 20 pages, 5 figures, Contribution to IJMPA special issue "Lattice
gauge theory beyond QCD". List of references update
Negative Correlation Learning for Customer Churn Prediction: A Comparison Study
Recently, telecommunication companies have been paying more attention toward the problem of identification of customer churn behavior. In business, it is well known for service providers that attracting new customers is much more expensive than retaining existing ones. Therefore, adopting accurate models that are able to predict customer churn can effectively help in customer retention campaigns and maximizing the profit. In this paper we will utilize an ensemble of Multilayer perceptrons
(MLP) whose training is obtained using negative correlation learning
(NCL) for predicting customer churn in a telecommunication company.
Experiments results confirm that NCL based MLP ensemble can achieve
better generalization performance (high churn rate) compared with ensemble
of MLP without NCL (flat ensemble) and other common data
mining techniques used for churn analysis
Software defect prediction framework based on hybrid metaheuristic optimization methods
A software defect is an error, failure, or fault in a software that produces an incorrect or unexpected result. Software defects are expensive in quality and cost. The accurate prediction of defect‐prone software modules certainly assist testing effort, reduce costs and improve the quality of software. The classification algorithm is a popular machine learning approach for software defect prediction. Unfortunately, software defect prediction remains a largely unsolved problem. As the first problem, the comparison and benchmarking results of the defect prediction using machine learning classifiers indicate that, the poor accuracy level is dominant and no particular classifiers perform best for all the datasets. There are two
main problems that affect classification performance in software defect prediction: noisy attributes and imbalanced class distribution of datasets, and difficulty of selecting optimal parameters of the classifiers. In this study, a software defect prediction framework that combines metaheuristic optimization methods for feature selection and parameter optimization, with meta learning methods for solving imbalanced class problem on datasets, which aims to improve the accuracy of classification models has been proposed. The proposed framework and models that are are considered to be the specific research contributions of this thesis are: 1) a comparison framework of classification models for software defect prediction known as CF-SDP, 2) a hybrid genetic algorithm based feature
selection and bagging technique for software defect prediction known as GAFS+B, 3) a hybrid particle swarm optimization based feature selection and bagging technique for software defect prediction known as PSOFS+B, and 4) a hybrid genetic algorithm based neural network parameter optimization and bagging technique for software defect prediction, known as NN-GAPO+B. For the purpose of this study, ten classification algorithms have been selected. The selection aims at achieving a balance between established classification algorithms used in software defect prediction. The proposed framework and methods are
evaluated using the state-of-the-art datasets from the NASA metric data repository. The results indicated that the proposed methods (GAFS+B, PSOFS+B and NN-GAPO+B) makes
an impressive improvement in the performance of software defect prediction. GAFS+B and PSOFS+B significantly affected on the performance of the class imbalance suffered
classifiers, such as C4.5 and CART. GAFS+B and PSOFS+B also outperformed the existing software defect prediction frameworks in most datasets. Based on the conducted
experiments, logistic regression performs best in most of the NASA MDP datasets, without or with feature selection method. The proposed methods also generated the selected relevant features in software defect prediction. The top ten most relevant features in software defect prediction include branch count metrics, decision density, halstead level metric of a module, number of operands contained in a module, maintenance severity, number of blank LOC, halstead volume, number of unique operands contained in a module, total number of LOC and design density
mldr.resampling: Efficient Reference Implementations of Multilabel Resampling Algorithms
Resampling algorithms are a useful approach to deal with imbalanced learning
in multilabel scenarios. These methods have to deal with singularities in the
multilabel data, such as the occurrence of frequent and infrequent labels in
the same instance. Implementations of these methods are sometimes limited to
the pseudocode provided by their authors in a paper. This Original Software
Publication presents mldr.resampling, a software package that provides
reference implementations for eleven multilabel resampling methods, with an
emphasis on efficiency since these algorithms are usually time-consuming
- …