1,718 research outputs found
Holistic Robust Data-Driven Decisions
The design of data-driven formulations for machine learning and
decision-making with good out-of-sample performance is a key challenge. The
observation that good in-sample performance does not guarantee good
out-of-sample performance is generally known as overfitting. Practical
overfitting can typically not be attributed to a single cause but instead is
caused by several factors all at once. We consider here three overfitting
sources: (i) statistical error as a result of working with finite sample data,
(ii) data noise which occurs when the data points are measured only with finite
precision, and finally (iii) data misspecification in which a small fraction of
all data may be wholly corrupted. We argue that although existing data-driven
formulations may be robust against one of these three sources in isolation they
do not provide holistic protection against all overfitting sources
simultaneously. We design a novel data-driven formulation which does guarantee
such holistic protection and is furthermore computationally viable. Our
distributionally robust optimization formulation can be interpreted as a novel
combination of a Kullback-Leibler and Levy-Prokhorov robust optimization
formulation which is novel in its own right. However, we show how in the
context of classification and regression problems that several popular
regularized and robust formulations reduce to a particular case of our proposed
novel formulation. Finally, we apply the proposed HR formulation on a portfolio
selection problem with real stock data, and analyze its risk/return tradeoff
against several benchmarks formulations. Our experiments show that our novel
ambiguity set provides a significantly better risk/return trade-off
Randomized Sketches of Convex Programs with Sharp Guarantees
Random projection (RP) is a classical technique for reducing storage and
computational costs. We analyze RP-based approximations of convex programs, in
which the original optimization problem is approximated by the solution of a
lower-dimensional problem. Such dimensionality reduction is essential in
computation-limited settings, since the complexity of general convex
programming can be quite high (e.g., cubic for quadratic programs, and
substantially higher for semidefinite programs). In addition to computational
savings, random projection is also useful for reducing memory usage, and has
useful properties for privacy-sensitive optimization. We prove that the
approximation ratio of this procedure can be bounded in terms of the geometry
of constraint set. For a broad class of random projections, including those
based on various sub-Gaussian distributions as well as randomized Hadamard and
Fourier transforms, the data matrix defining the cost function can be projected
down to the statistical dimension of the tangent cone of the constraints at the
original solution, which is often substantially smaller than the original
dimension. We illustrate consequences of our theory for various cases,
including unconstrained and -constrained least squares, support vector
machines, low-rank matrix estimation, and discuss implications on
privacy-sensitive optimization and some connections with de-noising and
compressed sensing
A Survey of Contextual Optimization Methods for Decision Making under Uncertainty
Recently there has been a surge of interest in operations research (OR) and
the machine learning (ML) community in combining prediction algorithms and
optimization techniques to solve decision-making problems in the face of
uncertainty. This gave rise to the field of contextual optimization, under
which data-driven procedures are developed to prescribe actions to the
decision-maker that make the best use of the most recently updated information.
A large variety of models and methods have been presented in both OR and ML
literature under a variety of names, including data-driven optimization,
prescriptive optimization, predictive stochastic programming, policy
optimization, (smart) predict/estimate-then-optimize, decision-focused
learning, (task-based) end-to-end learning/forecasting/optimization, etc.
Focusing on single and two-stage stochastic programming problems, this review
article identifies three main frameworks for learning policies from data and
discusses their strengths and limitations. We present the existing models and
methods under a uniform notation and terminology and classify them according to
the three main frameworks identified. Our objective with this survey is to both
strengthen the general understanding of this active field of research and
stimulate further theoretical and algorithmic advancements in integrating ML
and stochastic programming
Indonesia Composite Index Prediction using Fuzzy Support Vector Regression with Fisher Score Feature Selection
A precise forecast of stock price indexes may return a profit for investors. According to CNN Money, in the same month, as much as 93% of global investors have lost money for trading stock. One of the stock price indexes is the stock composite index. Exact predictions of the stock composite index can be critical for creating powerful market exchanging strategies. In this paper, a modified supervised learning method used to solve regression problems, Fuzzy Support Vector Regression (FSVR) is focused. As the complexity of many factors influences the movement of stock price prediction, the prediction results of Support Vector Regression (SVR) cannot always meet people with precision. Thus, this study implies Fuzzy Support Vector Regression (FSVR) stock prediction model, in which fuzzy membership with mapping function is employed to generate a precise price fluctuation of stock. To assure the use of features on model prediction, Fisher Score is used to find high-quality features that can enhance the accuracy. Indonesia Composite Index or Jakarta Composite Index (JKSE) is considered as input data and the result showed that Fisher Score could be applied as feature selection on Indonesia Composite Index prediction with the best model is eleven out of fifteen features with 80% of training data with 0.043529error
Responsible AI (RAI) Games and Ensembles
Several recent works have studied the societal effects of AI; these include
issues such as fairness, robustness, and safety. In many of these objectives, a
learner seeks to minimize its worst-case loss over a set of predefined
distributions (known as uncertainty sets), with usual examples being perturbed
versions of the empirical distribution. In other words, aforementioned problems
can be written as min-max problems over these uncertainty sets. In this work,
we provide a general framework for studying these problems, which we refer to
as Responsible AI (RAI) games. We provide two classes of algorithms for solving
these games: (a) game-play based algorithms, and (b) greedy stagewise
estimation algorithms. The former class is motivated by online learning and
game theory, whereas the latter class is motivated by the classical statistical
literature on boosting, and regression. We empirically demonstrate the
applicability and competitive performance of our techniques for solving several
RAI problems, particularly around subpopulation shift
Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning
Automated Machine Learning (AutoML) supports practitioners and researchers with the tedious task of designing machine learning pipelines and has recently achieved substantial success. In this paper, we introduce new AutoML approaches motivated by our winning submission to the second ChaLearn AutoML challenge. We develop PoSH Auto-sklearn, which enables AutoML systems to work well on large datasets under rigid time limits by using a new, simple and meta-feature-free meta-learning technique and by employing a successful bandit strategy for budget allocation. However, PoSH Auto-sklearn introduces even more ways of running AutoML and might make it harder for users to set it up correctly. Therefore, we also go one step further and study the design space of AutoML itself, proposing a solution towards truly hands-free AutoML. Together, these changes give rise to the next generation of our AutoML system, Auto-sklearn 2.0. We verify the improvements by these additions in an extensive experimental study on 39 AutoML benchmark datasets. We conclude the paper by comparing to other popular AutoML frameworks and Auto-sklearn 1.0, reducing the relative error by up to a factor of 4.5, and yielding a performance in 10 minutes that is substantially better than what Auto-sklearn 1.0 achieves within an hour
Intelligent Data Mining using Kernel Functions and Information Criteria
Radial Basis Function (RBF) Neural Networks and Support Vector Machines (SVM) are two powerful kernel related intelligent data mining techniques. The current major problems with these methods are over-fitting and the existence of too many free parameters. The way to select the parameters can directly affect the generalization performance(test error) of theses models. Current practice in how to choose the model parameters is an art, rather than a science in this research area. Often, some parameters are predetermined, or randomly chosen. Other parameters are selected through repeated experiments that are time consuming, costly, and computationally very intensive. In this dissertation, we provide a two-stage analytical hybrid-training algorithm by building a bridge among regression tree, EM algorithm, and Radial Basis Function Neural Networks together. Information Complexity (ICOMP) criterion of Bozdogan along with other information based criteria are introduced and applied to control the model complexity, and to decide the optimal number of kernel functions. In the first stage of the hybrid, regression tree and EM algorithm are used to determine the kernel function parameters. In the second stage of the hybrid, the weights (coefficients) are calculated and information criteria are scored. Kernel Principal Component Analysis (KPCA) using EM algorithm for feature selection and data preprocessing is also introduced and studied. Adaptive Support Vector Machines (ASVM) and some efficient algorithms are given to deal with massive data
sets in support vector classifications. Versatility and efficiency of the new
proposed approaches are studied on real data sets and via Monte Carlo sim-
ulation experiments
- …