12,330 research outputs found
Pairwise meta-rules for better meta-learning-based algorithm ranking
In this paper, we present a novel meta-feature generation method in the context of meta-learning, which is based on rules that compare the performance of individual base learners in a one-against-one manner. In addition to these new meta-features, we also introduce a new meta-learner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several state-of-the-art meta-learners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of meta-learning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset
Ensemble Sales Forecasting Study in Semiconductor Industry
Sales forecasting plays a prominent role in business planning and business
strategy. The value and importance of advance information is a cornerstone of
planning activity, and a well-set forecast goal can guide sale-force more
efficiently. In this paper CPU sales forecasting of Intel Corporation, a
multinational semiconductor industry, was considered. Past sale, future
booking, exchange rates, Gross domestic product (GDP) forecasting, seasonality
and other indicators were innovatively incorporated into the quantitative
modeling. Benefit from the recent advances in computation power and software
development, millions of models built upon multiple regressions, time series
analysis, random forest and boosting tree were executed in parallel. The models
with smaller validation errors were selected to form the ensemble model. To
better capture the distinct characteristics, forecasting models were
implemented at lead time and lines of business level. The moving windows
validation process automatically selected the models which closely represent
current market condition. The weekly cadence forecasting schema allowed the
model to response effectively to market fluctuation. Generic variable
importance analysis was also developed to increase the model interpretability.
Rather than assuming fixed distribution, this non-parametric permutation
variable importance analysis provided a general framework across methods to
evaluate the variable importance. This variable importance framework can
further extend to classification problem by modifying the mean absolute
percentage error(MAPE) into misclassify error. Please find the demo code at :
https://github.com/qx0731/ensemble_forecast_methodsComment: 14 pages, Industrial Conference on Data Mining 2017 (ICDM 2017
Reconstructing dynamical networks via feature ranking
Empirical data on real complex systems are becoming increasingly available.
Parallel to this is the need for new methods of reconstructing (inferring) the
topology of networks from time-resolved observations of their node-dynamics.
The methods based on physical insights often rely on strong assumptions about
the properties and dynamics of the scrutinized network. Here, we use the
insights from machine learning to design a new method of network reconstruction
that essentially makes no such assumptions. Specifically, we interpret the
available trajectories (data) as features, and use two independent feature
ranking approaches -- Random forest and RReliefF -- to rank the importance of
each node for predicting the value of each other node, which yields the
reconstructed adjacency matrix. We show that our method is fairly robust to
coupling strength, system size, trajectory length and noise. We also find that
the reconstruction quality strongly depends on the dynamical regime
Classifying pairs with trees for supervised biological network inference
Networks are ubiquitous in biology and computational approaches have been
largely investigated for their inference. In particular, supervised machine
learning methods can be used to complete a partially known network by
integrating various measurements. Two main supervised frameworks have been
proposed: the local approach, which trains a separate model for each network
node, and the global approach, which trains a single model over pairs of nodes.
Here, we systematically investigate, theoretically and empirically, the
exploitation of tree-based ensemble methods in the context of these two
approaches for biological network inference. We first formalize the problem of
network inference as classification of pairs, unifying in the process
homogeneous and bipartite graphs and discussing two main sampling schemes. We
then present the global and the local approaches, extending the later for the
prediction of interactions between two unseen network nodes, and discuss their
specializations to tree-based ensemble methods, highlighting their
interpretability and drawing links with clustering techniques. Extensive
computational experiments are carried out with these methods on various
biological networks that clearly highlight that these methods are competitive
with existing methods.Comment: 22 page
Separation of pulsar signals from noise with supervised machine learning algorithms
We evaluate the performance of four different machine learning (ML)
algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP ),
Adaboost, Gradient Boosting Classifier (GBC), XGBoost, for the separation of
pulsars from radio frequency interference (RFI) and other sources of noise,
using a dataset obtained from the post-processing of a pulsar search pi peline.
This dataset was previously used for cross-validation of the SPINN-based
machine learning engine, used for the reprocessing of HTRU-S survey data
arXiv:1406.3627. We have used Synthetic Minority Over-sampling Technique
(SMOTE) to deal with high class imbalance in the dataset. We report a variety
of quality scores from all four of these algorithms on both the non-SMOTE and
SMOTE datasets. For all the above ML methods, we report high accuracy and
G-mean in both the non-SMOTE and SMOTE cases. We study the feature importances
using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum
Relevance approach to report algorithm-agnostic feature ranking. From these
methods, we find that the signal to noise of the folded profile to be the best
feature. We find that all the ML algorithms report FPRs about an order of
magnitude lower than the corresponding FPRs obtained in arXiv:1406.3627, for
the same recall value.Comment: 14 pages, 2 figures. Accepted for publication in Astronomy and
Computin
- …