Search CORE

3,371 research outputs found

Data complexity in machine learning

Author: Abu-Mostafa Yaser S.
Li Ling
Publication venue: 'California Institute of Technology Library'
Publication date: 26/05/2006
Field of study

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations

Caltech Authors

Approximating data with weighted smoothing splines

Author: Davies Paul Lyndon
Meise Monika
Publication venue
Publication date
Field of study

n.a. --Approximation,Residuals,Smoothing Splines,Thin Plate Splines

Research Papers in Economics

A Dynamic Semiparametric Factor Model for Implied Volatility String Dynamics

Author: Enno Mammen
Matthias Fengler
Wolfgang Härdle
Publication venue
Publication date
Field of study

A primary goal in modelling the implied volatility surface (IVS) for pricing and hedging aims at reducing complexity. For this purpose one fits the IVS each day and applies a principal component analysis using a functional norm. This approach, however, neglects the degenerated string structure of the implied volatility data and may result in a modelling bias. We propose a dynamic semiparametric factor model (DSFM), which approximates the IVS in a finite dimensional function space. The key feature is that we only fit in the local neighborhood of the design points. Our approach is a combination of methods from functional principal component analysis and backfitting techniques for additive models. The model is found to have an approximate 10% better performance than a sticky moneyness model. Finally, based on the DSFM, we devise a generalized vega-hedging strategy for exotic options that are priced in the local volatility framework. The generalized vega-hedging extends the usual approaches employed in the local volatility framework.Smile, local volatility, generalized additive model, backfitting, functional principal component analysis

Research Papers in Economics

HoloDetect: Few-Shot Learning for Error Detection

Author: Bengio Yoshua
Elmagarmid Ahmed K.
Globerson Amir
Goodfellow Ian
Guo Chuan
Hinton G. E.
Rahm Erhard
Ratcliff John W.
Zhang Yu
Zhu Xiaojin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/04/2019
Field of study

We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

arXiv.org e-Print Archive

Crossref