3,379 research outputs found

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes

    Full text link
    Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach which is a combination of an heterogeneity sensitive spatial domain decomposition with an \textit{a priori} rearrangement of subdomain-walls. Within this approach, the theoretical modeling and scaling-laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase separated binary Lennard-Jones fluid.Comment: 14 pages, 12 figure

    Applications of lattice QCD techniques for condensed matter systems

    Full text link
    We review the application of lattice QCD techniques, most notably the Hybrid Monte-Carlo (HMC) simulations, to first-principle study of tight-binding models of crystalline solids with strong inter-electron interactions. After providing a basic introduction into the HMC algorithm as applied to condensed matter systems, we review HMC simulations of graphene, which in the recent years have helped to understand the semi-metal behavior of clean suspended graphene at the quantitative level. We also briefly summarize other novel physical results obtained in these simulations. Then we comment on the applicability of Hybrid Monte-Carlo to topological insulators and Dirac and Weyl semi-metals and highlight some of the relevant open physical problems. Finally, we also touch upon the lattice strong-coupling expansion technique as applied to condensed matter systems.Comment: 20 pages, 5 figures, Contribution to IJMPA special issue "Lattice gauge theory beyond QCD". List of references update

    Negative Correlation Learning for Customer Churn Prediction: A Comparison Study

    Get PDF
    Recently, telecommunication companies have been paying more attention toward the problem of identification of customer churn behavior. In business, it is well known for service providers that attracting new customers is much more expensive than retaining existing ones. Therefore, adopting accurate models that are able to predict customer churn can effectively help in customer retention campaigns and maximizing the profit. In this paper we will utilize an ensemble of Multilayer perceptrons (MLP) whose training is obtained using negative correlation learning (NCL) for predicting customer churn in a telecommunication company. Experiments results confirm that NCL based MLP ensemble can achieve better generalization performance (high churn rate) compared with ensemble of MLP without NCL (flat ensemble) and other common data mining techniques used for churn analysis

    Software defect prediction framework based on hybrid metaheuristic optimization methods

    Get PDF
    A software defect is an error, failure, or fault in a software that produces an incorrect or unexpected result. Software defects are expensive in quality and cost. The accurate prediction of defect‐prone software modules certainly assist testing effort, reduce costs and improve the quality of software. The classification algorithm is a popular machine learning approach for software defect prediction. Unfortunately, software defect prediction remains a largely unsolved problem. As the first problem, the comparison and benchmarking results of the defect prediction using machine learning classifiers indicate that, the poor accuracy level is dominant and no particular classifiers perform best for all the datasets. There are two main problems that affect classification performance in software defect prediction: noisy attributes and imbalanced class distribution of datasets, and difficulty of selecting optimal parameters of the classifiers. In this study, a software defect prediction framework that combines metaheuristic optimization methods for feature selection and parameter optimization, with meta learning methods for solving imbalanced class problem on datasets, which aims to improve the accuracy of classification models has been proposed. The proposed framework and models that are are considered to be the specific research contributions of this thesis are: 1) a comparison framework of classification models for software defect prediction known as CF-SDP, 2) a hybrid genetic algorithm based feature selection and bagging technique for software defect prediction known as GAFS+B, 3) a hybrid particle swarm optimization based feature selection and bagging technique for software defect prediction known as PSOFS+B, and 4) a hybrid genetic algorithm based neural network parameter optimization and bagging technique for software defect prediction, known as NN-GAPO+B. For the purpose of this study, ten classification algorithms have been selected. The selection aims at achieving a balance between established classification algorithms used in software defect prediction. The proposed framework and methods are evaluated using the state-of-the-art datasets from the NASA metric data repository. The results indicated that the proposed methods (GAFS+B, PSOFS+B and NN-GAPO+B) makes an impressive improvement in the performance of software defect prediction. GAFS+B and PSOFS+B significantly affected on the performance of the class imbalance suffered classifiers, such as C4.5 and CART. GAFS+B and PSOFS+B also outperformed the existing software defect prediction frameworks in most datasets. Based on the conducted experiments, logistic regression performs best in most of the NASA MDP datasets, without or with feature selection method. The proposed methods also generated the selected relevant features in software defect prediction. The top ten most relevant features in software defect prediction include branch count metrics, decision density, halstead level metric of a module, number of operands contained in a module, maintenance severity, number of blank LOC, halstead volume, number of unique operands contained in a module, total number of LOC and design density

    mldr.resampling: Efficient Reference Implementations of Multilabel Resampling Algorithms

    Full text link
    Resampling algorithms are a useful approach to deal with imbalanced learning in multilabel scenarios. These methods have to deal with singularities in the multilabel data, such as the occurrence of frequent and infrequent labels in the same instance. Implementations of these methods are sometimes limited to the pseudocode provided by their authors in a paper. This Original Software Publication presents mldr.resampling, a software package that provides reference implementations for eleven multilabel resampling methods, with an emphasis on efficiency since these algorithms are usually time-consuming
    corecore