360 research outputs found

    Tree Boosting Data Competitions with XGBoost

    Get PDF
    This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

    A Linear Estimator for Factor-Augmented Fixed-T Panels With Endogenous Regressors

    Get PDF
    A novel method-of-moments approach is proposed for the estimation of factor-augmented panel data models with endogenous regressors when T is fixed. The underlying methodology involves approximating the unobserved common factors using observed factor proxies. The resulting moment conditions are linear in the parameters. The proposed approach addresses several issues which arise with existing nonlinear estimators that are available in fixed T panels, such as local minima-related problems, a sensitivity to particular normalization schemes, and a potential lack of global identification. We apply our approach to a large panel of households and estimate the price elasticity of urban water demand. A simulation study confirms that our approach performs well in finite samples

    Exploring Interpretable LSTM Neural Networks over Multi-Variable Data

    Full text link
    For recurrent neural networks trained on time series with target and exogenous variables, in addition to accurate prediction, it is also desired to provide interpretable insights into the data. In this paper, we explore the structure of LSTM recurrent neural networks to learn variable-wise hidden states, with the aim to capture different dynamics in multi-variable time series and distinguish the contribution of variables to the prediction. With these variable-wise hidden states, a mixture attention mechanism is proposed to model the generative process of the target. Then we develop associated training methods to jointly learn network parameters, variable and temporal importance w.r.t the prediction of the target variable. Extensive experiments on real datasets demonstrate enhanced prediction performance by capturing the dynamics of different variables. Meanwhile, we evaluate the interpretation results both qualitatively and quantitatively. It exhibits the prospect as an end-to-end framework for both forecasting and knowledge extraction over multi-variable data.Comment: Accepted to International Conference on Machine Learning (ICML), 201

    A Markov blanket-based method for detecting causal SNPs in GWAS

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity.</p> <p>Results</p> <p>We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.</p> <p>Conclusions</p> <p>Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.</p

    ESTIMATION OF IMPLIED VOLATILITY SURFACE AND ITS DYNAMICS: EVIDENCE FROM S&P 500 INDEX OPTION IN POST-FINANCIAL CRISIS MARKET

    Get PDF
    There is now an extensive literature on modeling the implied volatility surface (IVS) as a function of options’ strike prices and time to maturity. The polynomial parameterization is one of these approaches and it provides a simple and efficient way for practitioners to estimate implied volatility. This project tests the predictive capability of this methodology in the post-financial crisis market. Using data for the period from July 1st, 2012 to June 30th, 2015 for European puts and calls of the S&amp;P 500 index options, we estimate a vector autoregressive model to capture the dynamics of the IVS. Our results show that this methodology has better predictive capability on IVS of index options in post-financial crisis market than on IVS of equity options in pre-financial crisis period
    corecore