9,763 research outputs found
Ensemble Committees for Stock Return Classification and Prediction
This paper considers a portfolio trading strategy formulated by algorithms in
the field of machine learning. The profitability of the strategy is measured by
the algorithm's capability to consistently and accurately identify stock
indices with positive or negative returns, and to generate a preferred
portfolio allocation on the basis of a learned model. Stocks are characterized
by time series data sets consisting of technical variables that reflect market
conditions in a previous time interval, which are utilized produce binary
classification decisions in subsequent intervals. The learned model is
constructed as a committee of random forest classifiers, a non-linear support
vector machine classifier, a relevance vector machine classifier, and a
constituent ensemble of k-nearest neighbors classifiers. The Global Industry
Classification Standard (GICS) is used to explore the ensemble model's efficacy
within the context of various fields of investment including Energy, Materials,
Financials, and Information Technology. Data from 2006 to 2012, inclusive, are
considered, which are chosen for providing a range of market circumstances for
evaluating the model. The model is observed to achieve an accuracy of
approximately 70% when predicting stock price returns three months in advance.Comment: 15 pages, 4 figures, Neukom Institute Computational Undergraduate
Research prize - second plac
An adaptive nearest neighbor rule for classification
We introduce a variant of the -nearest neighbor classifier in which is
chosen adaptively for each query, rather than supplied as a parameter. The
choice of depends on properties of each neighborhood, and therefore may
significantly vary between different points. (For example, the algorithm will
use larger for predicting the labels of points in noisy regions.)
We provide theory and experiments that demonstrate that the algorithm
performs comparably to, and sometimes better than, -NN with an optimal
choice of . In particular, we derive bounds on the convergence rates of our
classifier that depend on a local quantity we call the `advantage' which is
significantly weaker than the Lipschitz conditions used in previous convergence
rate proofs. These generalization bounds hinge on a variant of the seminal
Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant
concerns conditional probabilities and may be of independent interest
Prediction of water retention of soils from the humid tropics by the nonparametric k-nearest neighbor approach
Nonparametric approaches such as the k-nearest neighbor (k-NN) approach are considered attractive for pedotransfer modeling in hydrology; however, they have not been applied to predict water retention of highly weathered soils in the humid tropics. Therefore, the objectives of this study were: to apply the k-NN approach to predict soil water retention in a humid tropical region; to test its ability to predict soil water content at eight different matric potentials; to test the benefit of using more input attributes than most previous studies and their combinations; to discuss the importance of particular input attributes in the prediction of soil water retention at low, intermediate, and high matric potentials; and to compare this approach with two published tropical pedotransfer functions (PTFs) based on multiple linear regression (MLR). The overall estimation error ranges generated by the k-NN approach were statistically different but comparable to the two examined MLR PTFs. When the best combination of input variables (sand + silt + clay + bulk density + cation exchange capacity) was used, the overall error was remarkably low: 0.0360 to 0.0390 m(3) m(-3) in the dry and very wet ranges and 0.0490 to 0.0510 m(3) m(-3) in the intermediate range (i.e., -3 to -50 kPa) of the soil water retention curve. This k-NN variant can be considered as a competitive alternative to more classical, equation-based PTFs due to the accuracy of the water retention estimation and, as an added benefit, its flexibility to incorporate new data without the need to redevelop new equations. This is highly beneficial in developing countries where soil databases for agricultural planning are at present sparse, though slowly developing
Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
Undetected overfitting can occur when there are significant redundancies
between training and validation data. We describe AVE, a new measure of
training-validation redundancy for ligand-based classification problems that
accounts for the similarity amongst inactive molecules as well as active. We
investigated seven widely-used benchmarks for virtual screening and
classification, and show that the amount of AVE bias strongly correlates with
the performance of ligand-based predictive methods irrespective of the
predicted property, chemical fingerprint, similarity measure, or
previously-applied unbiasing techniques. Therefore, it may be that the
previously-reported performance of most ligand-based methods can be explained
by overfitting to benchmarks rather than good prospective accuracy
A Non-Sequential Representation of Sequential Data for Churn Prediction
We investigate the length of event sequence giving best predictions
when using a continuous HMM approach to churn prediction from sequential
data. Motivated by observations that predictions based on only the few most recent
events seem to be the most accurate, a non-sequential dataset is constructed
from customer event histories by averaging features of the last few events. A simple
K-nearest neighbor algorithm on this dataset is found to give significantly
improved performance. It is quite intuitive to think that most people will react
only to events in the fairly recent past. Events related to telecommunications occurring
months or years ago are unlikely to have a large impact on a customer’s
future behaviour, and these results bear this out. Methods that deal with sequential
data also tend to be much more complex than those dealing with simple nontemporal
data, giving an added benefit to expressing the recent information in a
non-sequential manner
- …