Search CORE

9,763 research outputs found

Ensemble Committees for Stock Return Classification and Prediction

Author: Brofos James
Publication venue
Publication date: 05/04/2014
Field of study

This paper considers a portfolio trading strategy formulated by algorithms in the field of machine learning. The profitability of the strategy is measured by the algorithm's capability to consistently and accurately identify stock indices with positive or negative returns, and to generate a preferred portfolio allocation on the basis of a learned model. Stocks are characterized by time series data sets consisting of technical variables that reflect market conditions in a previous time interval, which are utilized produce binary classification decisions in subsequent intervals. The learned model is constructed as a committee of random forest classifiers, a non-linear support vector machine classifier, a relevance vector machine classifier, and a constituent ensemble of k-nearest neighbors classifiers. The Global Industry Classification Standard (GICS) is used to explore the ensemble model's efficacy within the context of various fields of investment including Energy, Materials, Financials, and Information Technology. Data from 2006 to 2012, inclusive, are considered, which are chosen for providing a range of market circumstances for evaluating the model. The model is observed to achieve an accuracy of approximately 70% when predicting stock price returns three months in advance.Comment: 15 pages, 4 figures, Neukom Institute Computational Undergraduate Research prize - second plac

arXiv.org e-Print Archive

CiteSeerX

An adaptive nearest neighbor rule for classification

Author: Balsubramani Akshay
Dasgupta Sanjoy
Freund Yoav
Moran Shay
Publication venue
Publication date: 01/01/2019
Field of study

We introduce a variant of the

k

-nearest neighbor classifier in which

k

is chosen adaptively for each query, rather than supplied as a parameter. The choice of

k

depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger

k

for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than,

k

-NN with an optimal choice of

k

. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest

arXiv.org e-Print Archive

eScholarship - University of California

Prediction of water retention of soils from the humid tropics by the nonparametric k-nearest neighbor approach

Author: Attila Nemes
Babalola
Bannayan
Berg
Botula
Bouma
Bouma
Braudeau
Bronick
Brooks
Buishand
Cornelis
Cornelis
Dam
Dasarathy
Deng
Elshorbagy
Elshorbagy
Eric Van Ranst
Finke
Genuchten
Gharahi Ghehi
Givi
Guber
Gupta
Haghverdi
Hastie
Haykin
Hecht-Nielsen
Hodnett
Hopmans
IUSS Working Group WRB
Jagtap
Köhn
Lal
Loosvelt
Medina
Minasny
Moeys
Mualem
Mucherino
Nemes
Nemes
Nemes
Nemes
Nemes
Nemes
Noble
Ottoni
Pachepsky
Pachepsky
Parasuraman
Parasuraman
Patil
Paul Mafuka
Perkins
Pidgeon
Puckett
Rajkai
Rawls
Reichardt
Reichert
Schaap
Schaap
Schwartz
Sharma
Soil Survey Staff
Soil Survey Staff
Tempel
Tomasella
Tomasella
Tranter
Twarakavi
Van Ranst
Vapnik
Vapnik
Vereecken
Vereecken
Vereecken
Weynants
Williams
Wim M. Cornelis
Wösten
Yves-Dady Botula
Zacharias
Publication venue: 'Soil Science Society of America'
Publication date: 01/01/2013
Field of study

Nonparametric approaches such as the k-nearest neighbor (k-NN) approach are considered attractive for pedotransfer modeling in hydrology; however, they have not been applied to predict water retention of highly weathered soils in the humid tropics. Therefore, the objectives of this study were: to apply the k-NN approach to predict soil water retention in a humid tropical region; to test its ability to predict soil water content at eight different matric potentials; to test the benefit of using more input attributes than most previous studies and their combinations; to discuss the importance of particular input attributes in the prediction of soil water retention at low, intermediate, and high matric potentials; and to compare this approach with two published tropical pedotransfer functions (PTFs) based on multiple linear regression (MLR). The overall estimation error ranges generated by the k-NN approach were statistically different but comparable to the two examined MLR PTFs. When the best combination of input variables (sand + silt + clay + bulk density + cation exchange capacity) was used, the overall error was remarkably low: 0.0360 to 0.0390 m(3) m(-3) in the dry and very wet ranges and 0.0490 to 0.0510 m(3) m(-3) in the intermediate range (i.e., -3 to -50 kPa) of the soil water retention curve. This k-NN variant can be considered as a competitive alternative to more classical, equation-based PTFs due to the accuracy of the water retention estimation and, as an added benefit, its flexibility to incorporate new data without the need to redevelop new equations. This is highly beneficial in developing countries where soil databases for agricultural planning are at present sparse, though slowly developing

Crossref

Ghent University Academic Bibliography

Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

Author: Heifets Abraham
Wallach Izhar
Publication venue: 'American Chemical Society (ACS)'
Publication date: 09/05/2018
Field of study

Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy

arXiv.org e-Print Archive

FigShare

A Non-Sequential Representation of Sequential Data for Churn Prediction

Author: A. Lemmens
C.-P. Wei
D. Ruta
J.R. Quinlan
L. Breiman
R. Duda
T.G. Dietterich
Y. Freund
Y.-S. Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

We investigate the length of event sequence giving best predictions when using a continuous HMM approach to churn prediction from sequential data. Motivated by observations that predictions based on only the few most recent events seem to be the most accurate, a non-sequential dataset is constructed from customer event histories by averaging features of the last few events. A simple K-nearest neighbor algorithm on this dataset is found to give significantly improved performance. It is quite intuitive to think that most people will react only to events in the fairly recent past. Events related to telecommunications occurring months or years ago are unlikely to have a large impact on a customer’s future behaviour, and these results bear this out. Methods that deal with sequential data also tend to be much more complex than those dealing with simple nontemporal data, giving an added benefit to expressing the recent information in a non-sequential manner

Crossref

Bournemouth University Research Online

Coventry University Pure Portal