3,371 research outputs found

    Data complexity in machine learning

    Get PDF
    We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations

    Approximating data with weighted smoothing splines

    Get PDF
    n.a. --Approximation,Residuals,Smoothing Splines,Thin Plate Splines

    A Dynamic Semiparametric Factor Model for Implied Volatility String Dynamics

    Get PDF
    A primary goal in modelling the implied volatility surface (IVS) for pricing and hedging aims at reducing complexity. For this purpose one fits the IVS each day and applies a principal component analysis using a functional norm. This approach, however, neglects the degenerated string structure of the implied volatility data and may result in a modelling bias. We propose a dynamic semiparametric factor model (DSFM), which approximates the IVS in a finite dimensional function space. The key feature is that we only fit in the local neighborhood of the design points. Our approach is a combination of methods from functional principal component analysis and backfitting techniques for additive models. The model is found to have an approximate 10% better performance than a sticky moneyness model. Finally, based on the DSFM, we devise a generalized vega-hedging strategy for exotic options that are priced in the local volatility framework. The generalized vega-hedging extends the usual approaches employed in the local volatility framework.Smile, local volatility, generalized additive model, backfitting, functional principal component analysis

    HoloDetect: Few-Shot Learning for Error Detection

    Full text link
    We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages
    • ā€¦
    corecore