9,763 research outputs found

    Ensemble Committees for Stock Return Classification and Prediction

    Full text link
    This paper considers a portfolio trading strategy formulated by algorithms in the field of machine learning. The profitability of the strategy is measured by the algorithm's capability to consistently and accurately identify stock indices with positive or negative returns, and to generate a preferred portfolio allocation on the basis of a learned model. Stocks are characterized by time series data sets consisting of technical variables that reflect market conditions in a previous time interval, which are utilized produce binary classification decisions in subsequent intervals. The learned model is constructed as a committee of random forest classifiers, a non-linear support vector machine classifier, a relevance vector machine classifier, and a constituent ensemble of k-nearest neighbors classifiers. The Global Industry Classification Standard (GICS) is used to explore the ensemble model's efficacy within the context of various fields of investment including Energy, Materials, Financials, and Information Technology. Data from 2006 to 2012, inclusive, are considered, which are chosen for providing a range of market circumstances for evaluating the model. The model is observed to achieve an accuracy of approximately 70% when predicting stock price returns three months in advance.Comment: 15 pages, 4 figures, Neukom Institute Computational Undergraduate Research prize - second plac

    An adaptive nearest neighbor rule for classification

    Full text link
    We introduce a variant of the kk-nearest neighbor classifier in which kk is chosen adaptively for each query, rather than supplied as a parameter. The choice of kk depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger kk for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, kk-NN with an optimal choice of kk. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest

    Prediction of water retention of soils from the humid tropics by the nonparametric k-nearest neighbor approach

    Get PDF
    Nonparametric approaches such as the k-nearest neighbor (k-NN) approach are considered attractive for pedotransfer modeling in hydrology; however, they have not been applied to predict water retention of highly weathered soils in the humid tropics. Therefore, the objectives of this study were: to apply the k-NN approach to predict soil water retention in a humid tropical region; to test its ability to predict soil water content at eight different matric potentials; to test the benefit of using more input attributes than most previous studies and their combinations; to discuss the importance of particular input attributes in the prediction of soil water retention at low, intermediate, and high matric potentials; and to compare this approach with two published tropical pedotransfer functions (PTFs) based on multiple linear regression (MLR). The overall estimation error ranges generated by the k-NN approach were statistically different but comparable to the two examined MLR PTFs. When the best combination of input variables (sand + silt + clay + bulk density + cation exchange capacity) was used, the overall error was remarkably low: 0.0360 to 0.0390 m(3) m(-3) in the dry and very wet ranges and 0.0490 to 0.0510 m(3) m(-3) in the intermediate range (i.e., -3 to -50 kPa) of the soil water retention curve. This k-NN variant can be considered as a competitive alternative to more classical, equation-based PTFs due to the accuracy of the water retention estimation and, as an added benefit, its flexibility to incorporate new data without the need to redevelop new equations. This is highly beneficial in developing countries where soil databases for agricultural planning are at present sparse, though slowly developing

    Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

    Full text link
    Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy

    A Non-Sequential Representation of Sequential Data for Churn Prediction

    Get PDF
    We investigate the length of event sequence giving best predictions when using a continuous HMM approach to churn prediction from sequential data. Motivated by observations that predictions based on only the few most recent events seem to be the most accurate, a non-sequential dataset is constructed from customer event histories by averaging features of the last few events. A simple K-nearest neighbor algorithm on this dataset is found to give significantly improved performance. It is quite intuitive to think that most people will react only to events in the fairly recent past. Events related to telecommunications occurring months or years ago are unlikely to have a large impact on a customer’s future behaviour, and these results bear this out. Methods that deal with sequential data also tend to be much more complex than those dealing with simple nontemporal data, giving an added benefit to expressing the recent information in a non-sequential manner
    • …
    corecore