Search CORE

360 research outputs found

Tree Boosting Data Competitions with XGBoost

Author: Bort Escabias Carlos
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 01/01/2017
Field of study

This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Load/Price Forecasting and Management of Demand Response for Smart Grids: Methodologies and Challenges

Author: Chan SC
Hou Y
Tsui KM
Wu FF
Wu HC
Wu YC
Publication venue
Publication date: 01/01/2012
Field of study

published_or_final_versio

HKU Scholars Hub

A Linear Estimator for Factor-Augmented Fixed-T Panels With Endogenous Regressors

Author: Juodis Artūras
Sarafidis Vasilis
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2022
Field of study

A novel method-of-moments approach is proposed for the estimation of factor-augmented panel data models with endogenous regressors when T is fixed. The underlying methodology involves approximating the unobserved common factors using observed factor proxies. The resulting moment conditions are linear in the parameters. The proposed approach addresses several issues which arise with existing nonlinear estimators that are available in fixed T panels, such as local minima-related problems, a sensitivity to particular normalization schemes, and a potential lack of global identification. We apply our approach to a large panel of households and estimate the price elasticity of urban water demand. A simulation study confirms that our approach performs well in finite samples

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Exploring Interpretable LSTM Neural Networks over Multi-Variable Data

Author: Antulov-Fantulin Nino
Guo Tian
Lin Tao
Publication venue
Publication date: 01/01/2019
Field of study

For recurrent neural networks trained on time series with target and exogenous variables, in addition to accurate prediction, it is also desired to provide interpretable insights into the data. In this paper, we explore the structure of LSTM recurrent neural networks to learn variable-wise hidden states, with the aim to capture different dynamics in multi-variable time series and distinguish the contribution of variables to the prediction. With these variable-wise hidden states, a mixture attention mechanism is proposed to model the generative process of the target. Then we develop associated training methods to jointly learn network parameters, variable and temporal importance w.r.t the prediction of the target variable. Extensive experiments on real datasets demonstrate enhanced prediction performance by capturing the dynamics of different variables. Meanwhile, we evaluate the interpretation results both qualitatively and quantitatively. It exhibits the prospect as an end-to-end framework for both forecasting and knowledge extraction over multi-variable data.Comment: Accepted to International Conference on Machine Learning (ICML), 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

A Markov blanket-based method for detecting causal SNPs in GWAS

Author: A Hamosh
BA McKinney
Bing Han
C Kooperberg
C-c Chang
CF Aliferis
D Koller
D Margaritis
DF Easton
HJ Cordell
I Tsamardinos
I Tsamardinos
J Fellay
J Li
J Marchini
JH McDonald
JH Moore
JK Pritchard
LW Hahn
M Robnik-Šikonja
MD Ritchie
MD Ritchie
MD Shriver
Meeyoung Park
MY Park
P Spirtes
R Jiang
RJ Klein
RR Sokal
SE Antonarakis
SH Chen
SK Musani
ST Sherry
X-W Chen
Xue-wen Chen
Y Zhang
Publication venue: BioMed Central
Publication date: 01/04/2010
Field of study

Abstract Background Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity. Results We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases. Conclusions Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.</p

Crossref

Directory of Open Access Journals

KU ScholarWorks

PubMed Central

ESTIMATION OF IMPLIED VOLATILITY SURFACE AND ITS DYNAMICS: EVIDENCE FROM S&P 500 INDEX OPTION IN POST-FINANCIAL CRISIS MARKET

Author: Shouting Sun
Sijia Ji
Publication venue
Publication date: 01/12/2015
Field of study

There is now an extensive literature on modeling the implied volatility surface (IVS) as a function of options’ strike prices and time to maturity. The polynomial parameterization is one of these approaches and it provides a simple and efficient way for practitioners to estimate implied volatility. This project tests the predictive capability of this methodology in the post-financial crisis market. Using data for the period from July 1st, 2012 to June 30th, 2015 for European puts and calls of the S&P 500 index options, we estimate a vector autoregressive model to capture the dynamics of the IVS. Our results show that this methodology has better predictive capability on IVS of index options in post-financial crisis market than on IVS of equity options in pre-financial crisis period

Simon Fraser University Institutional Repository