982 research outputs found

    Experimental Study on 164 Algorithms Available in Software Tools for Solving Standard Non-Linear Regression Problems

    Get PDF
    In the specialized literature, researchers can find a large number of proposals for solving regression problems that come from different research areas. However, researchers tend to use only proposals from the area in which they are experts. This paper analyses the performance of a large number of the available regression algorithms from some of the most known and widely used software tools in order to help non-expert users from other areas to properly solve their own regression problems and to help specialized researchers developing well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments. To sum up, we have analyzed 164 algorithms that come from 14 main different families available in 6 software tools (Neural Networks, Support Vector Machines, Regression Trees, Rule-Based Methods, Stacking, Random Forests, Model trees, Generalized Linear Models, Nearest Neighbor methods, Partial Least Squares and Principal Component Regression, Multivariate Adaptive Regression Splines, Bagging, Boosting, and other methods) over 52 datasets. A new measure has also been proposed to show the goodness of each algorithm with respect to the others. Finally, a statistical analysis by non-parametric tests has been carried out over all the algorithms and on the best 30 algorithms, both with and without bagging. Results show that the algorithms from Random Forest, Model Tree and Support Vector Machine families get the best positions in the rankings obtained by the statistical tests when bagging is not considered. In addition, the use of bagging techniques significantly improves the performance of the algorithms without excessive increase in computational times.This work was supported in part by the University of CĂłrdoba under the project PPG2019-UCOSOCIAL-03, and in part by the Spanish Ministry of Science, Innovation and Universities under Grant TIN2015- 68454-R and Grant TIN2017-89517-P

    A Classification Framework for Imbalanced Data

    Get PDF
    As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them. Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced. Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM. Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy

    Artificial intelligence in wind speed forecasting: a review

    Get PDF
    Wind energy production has had accelerated growth in recent years, reaching an annual increase of 17% in 2021. Wind speed plays a crucial role in the stability required for power grid operation. However, wind intermittency makes accurate forecasting a complicated process. Implementing new technologies has allowed the development of hybrid models and techniques, improving wind speed forecasting accuracy. Additionally, statistical and artificial intelligence methods, especially artificial neural networks, have been applied to enhance the results. However, there is a concern about identifying the main factors influencing the forecasting process and providing a basis for estimation with artificial neural network models. This paper reviews and classifies the forecasting models used in recent years according to the input model type, the pre-processing and post-processing technique, the artificial neural network model, the prediction horizon, the steps ahead number, and the evaluation metric. The research results indicate that artificial neural network (ANN)-based models can provide accurate wind forecasting and essential information about the specific location of potential wind use for a power plant by understanding the future wind speed values

    Learning to predict under a budget

    Get PDF
    Prediction-time budgets in machine learning applications can arise due to monetary or computational costs associated with acquiring information; they also arise due to latency and power consumption costs in evaluating increasingly more complex models. The goal in such budgeted prediction problems is to learn decision systems that maintain high prediction accuracy while meeting average cost constraints during prediction-time. Such decision systems can potentially adapt to the input examples, predicting most of them at low cost while allocating more budget for the few "hard" examples. In this thesis, I will present several learning methods to better trade-off cost and error during prediction. The conceptual contribution of this thesis is to develop a new paradigm of bottom-up approach instead of the traditional top-down approach. A top-down approach attempts to build out the model by selectively adding the most cost-effective features to improve accuracy. In contrast, a bottom-up approach first learns a highly accurate model and then prunes or adaptively approximates it to trade-off cost and error. Training top-down models in case of feature acquisition costs leads to fundamental combinatorial issues in multi-stage search over all feature subsets. In contrast, we show that the bottom-up methods bypass many of such issues. To develop this theme, we first propose two top-down methods and then two bottom-up methods. The first top-down method uses margin information from training data in the partial feature neighborhood of a test point to either select the next best feature in a greedy fashion or to stop and make prediction. The second top-down method is a variant of random forest (RF) algorithm. We grow decision trees with low acquisition cost and high strength based on greedy mini-max cost-weighted impurity splits. Theoretically, we establish near-optimal acquisition cost guarantees for our algorithm. The first bottom-up method we propose is based on pruning RFs to optimize expected feature cost and accuracy. Given a RF as input, we pose pruning as a novel 0-1 integer program and show that it can be solved exactly via LP relaxation. We further develop a fast primal-dual algorithm that scales to large datasets. The second bottom-up method is adaptive approximation, which significantly generalizes the RF pruning to accommodate more models and other types of costs besides feature acquisition cost. We first train a high-accuracy, high-cost model. We then jointly learn a low-cost gating function together with a low-cost prediction model to adaptively approximate the high-cost model. The gating function identifies the regions of the input space where the low-cost model suffices for making highly accurate predictions. We demonstrate empirical performance of these methods and compare them to the state-of-the-arts. Finally, we study adaptive approximation in the on-line setting to obtain regret guarantees and discuss future work.2019-07-02T00:00:00

    Ensemble Support Vector Machine Models of Radiation-Induced Lung Injury Risk

    Get PDF
    Patients undergoing radiation therapy can develop a potentially fatal inflammation of the lungs known as radiation pneumonitis: RP). In practice, modeling RP factors is difficult because existing data are under-sampled and imbalanced. Support vector machines: SVMs), a class of statistical learning methods that implicitly maps data into a higher dimensional space, is one machine learning method that recently has been applied to the RP problem with encouraging results. In this thesis, we present and evaluate an ensemble SVM method of modeling radiation pneumonitis. The method internalizes kernel/model parameter selection into model building and enables feature scaling via Olivier Chapelle\u27s method. We show that the ensemble method provides statistically significant increases to the cross-folded area under the receiver operating characteristic curve while maintaining model parsimony. Finally, we extend our model with John C. Platt\u27s method to support non-binary outcomes in order to augment clinical relevancy

    Advances and applications in Ensemble Learning

    Get PDF

    Do we need hundreds of classifiers to solve real world classification problems?

    Get PDF
    We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large- scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively)We would like to acknowledge support from the Spanish Ministry of Science and Innovation (MICINN), which supported this work under projects TIN2011-22935 and TIN2012-32262S

    Representation Of Uncertainty And Corridor Dp For Hydropower Optimization

    Full text link
    This thesis focuses on optimization techniques for multi-reservoir hydropower systems operation, with a particular concern with the representation and impact of uncertainty. The thesis reports on three research investigations: 1) examination of the impact of uncertainty representations, 2) efficient solution methods for multi-reservoir stochastic dynamic programming (SDP) models, and 3) diagnostic analyses for hydropower system operation. The first investigation explores the value of sophistication in the representation of forecast and inflow uncertainty in stochastic hydropower optimization models using a sampling SDP (SSDP) model framework. SSDP models with different uncertainty representation ranging in sophistication from simple deterministic to complex dynamic stochastic models are employed when optimize a single reservoir systems [similar to Faber and Stedinger, 2001]. The effect of uncertainty representation on simulated system performance is examined with varying storage and powerhouse capacity, and with random or mean energy prices. In many cases very simple uncertainty models perform as well as more complex ones, but not always. The second investigation develops a new and efficient algorithm for solving multi-reservoir SDP models: Corridor SDP. Rather than employing a uniform grid across the entire state space, Corridor SDP efficiently concentrates points in where the system is likely to visit, as determined by historical operations or simulation. Radial basis functions (RBFs) are used for interpolation. A greedy algorithm places points where they are needed to achieve a good approximation. In a four-reservoir test case, Corridor DP achieves the same accuracy as spline-DP and linear-DP with approximately 1/10 and 1/1100 the number of discrete points, respectively. When local curvature is more pronounced (due to minimum-flow constraints), Corridor DP achieves the same accuracy as spline-DP and linear-DP with approximately 1/30 and 1/215 the number of points, respectively. The third investigation explores three diagnostic approaches for analyzing hydropower system operation. First, several simple diagnostic statistics describe reservoir volume and powerhouse capacity in units of time, allowing scale-invariant comparisons and classification of different reservoir systems and their operation. Second, a regression analysis using optimal storage/release sequences identifies the most useful hydrologic state variables . Finally spectral density estimation identifies critical time scales for operation for several single-reservoir systems considering mean and random energy prices. Deregulation of energy markets has made optimization of hydropower operations an active concern. Another development is publication of Extended Streamflow Forecasts (ESP) by the National Weather Service (NWS) and others to describe flow forecasts and their precision; the multivariate Sampling SDP models employed here are appropriately structured to incorporate such information in operational hydropower decisions. This research contributes to our ability to structure and build effective hydropower optimization models
    • …
    corecore