16,991 research outputs found
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
Spiking neurons with short-term synaptic plasticity form superior generative networks
Spiking networks that perform probabilistic inference have been proposed both
as models of cortical computation and as candidates for solving problems in
machine learning. However, the evidence for spike-based computation being in
any way superior to non-spiking alternatives remains scarce. We propose that
short-term plasticity can provide spiking networks with distinct computational
advantages compared to their classical counterparts. In this work, we use
networks of leaky integrate-and-fire neurons that are trained to perform both
discriminative and generative tasks in their forward and backward information
processing paths, respectively. During training, the energy landscape
associated with their dynamics becomes highly diverse, with deep attractor
basins separated by high barriers. Classical algorithms solve this problem by
employing various tempering techniques, which are both computationally
demanding and require global state updates. We demonstrate how similar results
can be achieved in spiking networks endowed with local short-term synaptic
plasticity. Additionally, we discuss how these networks can even outperform
tempering-based approaches when the training data is imbalanced. We thereby
show how biologically inspired, local, spike-triggered synaptic dynamics based
simply on a limited pool of synaptic resources can allow spiking networks to
outperform their non-spiking relatives.Comment: corrected typo in abstrac
Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)
We report and fix an important systematic error in prior studies that ranked
classifiers for software analytics. Those studies did not (a) assess
classifiers on multiple criteria and they did not (b) study how variations in
the data affect the results. Hence, this paper applies (a) multi-criteria tests
while (b) fixing the weaker regions of the training data (using SMOTUNED, which
is a self-tuning version of SMOTE). This approach leads to dramatically large
increases in software defect predictions. When applied in a 5*5
cross-validation study for 3,681 JAVA classes (containing over a million lines
of code) from open source systems, SMOTUNED increased AUC and recall by 60% and
20% respectively. These improvements are independent of the classifier used to
predict for quality. Same kind of pattern (improvement) was observed when a
comparative analysis of SMOTE and SMOTUNED was done against the most recent
class imbalance technique. In conclusion, for software analytic tasks like
defect prediction, (1) data pre-processing can be more important than
classifier choice, (2) ranking studies are incomplete without such
pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of
Software Engineering (ICSE), 201
Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)
We report and fix an important systematic error in prior studies that ranked
classifiers for software analytics. Those studies did not (a) assess
classifiers on multiple criteria and they did not (b) study how variations in
the data affect the results. Hence, this paper applies (a) multi-criteria tests
while (b) fixing the weaker regions of the training data (using SMOTUNED, which
is a self-tuning version of SMOTE). This approach leads to dramatically large
increases in software defect predictions. When applied in a 5*5
cross-validation study for 3,681 JAVA classes (containing over a million lines
of code) from open source systems, SMOTUNED increased AUC and recall by 60% and
20% respectively. These improvements are independent of the classifier used to
predict for quality. Same kind of pattern (improvement) was observed when a
comparative analysis of SMOTE and SMOTUNED was done against the most recent
class imbalance technique. In conclusion, for software analytic tasks like
defect prediction, (1) data pre-processing can be more important than
classifier choice, (2) ranking studies are incomplete without such
pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of
Software Engineering (ICSE), 201
Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
We show that matrix completion with trace-norm regularization can be
significantly hurt when entries of the matrix are sampled non-uniformly. We
introduce a weighted version of the trace-norm regularizer that works well also
with non-uniform sampling. Our experimental results demonstrate that the
weighted trace-norm regularization indeed yields significant gains on the
(highly non-uniformly sampled) Netflix dataset.Comment: 9 page
A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis.
The human body generates a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling
Customer profile classification using transactional data
Customer profiles are by definition made up of factual and transactional data. It is often the case that due to reasons such as high cost of data acquisition and/or protection, only the transactional data are available for data mining operations. Transactional data, however, tend to be highly sparse and skewed due to a large proportion of customers engaging in very few transactions. This can result in a bias in the prediction accuracy of classifiers built using them towards the larger proportion of customers with fewer transactions. This paper investigates an approach for accurately and confidently grouping and classifying customers in bins on the basis of the number of their transactions. The experiments we conducted on a highly sparse and skewed real-world transactional data show that our proposed approach can be used to identify a critical point at which customer profiles can be more confidently distinguished
Surmounting the sign problem in non-relativistic calculations: a case study with mass-imbalanced fermions
The calculation of the ground state and thermodynamics of mass-imbalanced
Fermi systems is a challenging many-body problem. Even in one spatial
dimension, analytic solutions are limited to special configurations and
numerical progress with standard Monte Carlo approaches is hindered by the sign
problem. The focus of the present work is on the further development of methods
to study imbalanced systems in a fully non-perturbative fashion. We report our
calculations of the ground-state energy of mass-imbalanced fermions using two
different approaches which are also very popular in the context of the theory
of the strong interaction (Quantum Chromodynamics, QCD): (a) the hybrid Monte
Carlo algorithm with imaginary mass imbalance, followed by an analytic
continuation to the real axis; and (b) the Complex Langevin algorithm. We cover
a range of on-site interaction strengths that includes strongly attractive as
well as strongly repulsive cases which we verify with non-perturbative
renormalization group methods and perturbation theory. Our findings indicate
that, for strong repulsive couplings, the energy starts to flatten out,
implying interesting consequences for short-range and high-frequency
correlation functions. Overall, our results clearly indicate that the Complex
Langevin approach is very versatile and works very well for imbalanced Fermi
gases with both attractive and repulsive interactions.Comment: 11 pages, 5 figure
- …