174 research outputs found

    Comparing linear discriminant analysis and supervised learning algorithms for binary classification - a method comparison study

    Get PDF
    In psychology, linear discriminant analysis (LDA) is the method of choice for two-group classification tasks based on questionnaire data. In this study, we present a comparison of LDA with several supervised learning algorithms. In particular, we examine to what extent the predictive performance of LDA relies on the multivariate normality assumption. As nonparametric alternatives, the linear support vector machine (SVM), classification and regression tree (CART), random forest (RF), probabilistic neural network (PNN), and the ensemble k conditional nearest neighbor (EkCNN) algorithms are applied. Predictive performance is determined using measures of overall performance, discrimination, and calibration, and is compared in two reference data sets as well as in a simulation study. The reference data are Likert-type data, and comprise 5 and 10 predictor variables, respectively. Simulations are based on the reference data and are done for a balanced and an unbalanced scenario in each case. In order to compare the algorithms' performance, data are simulated from multivariate distributions with differing degrees of nonnormality. Results differ depending on the specific performance measure. The main finding is that LDA is always outperformed by RF in the bimodal data with respect to overall performance. Discriminative ability of the RF algorithm is often higher compared to LDA, but its model calibration is usually worse. Still LDA mostly ranges second in cases it is outperformed by another algorithm, or the differences are only marginal. In consequence, we still recommend LDA for this type of application

    Attributed Network Embedding for Learning in a Dynamic Environment

    Full text link
    Network embedding leverages the node proximity manifested to learn a low-dimensional node vector representation for each node in the network. The learned embeddings could advance various learning tasks such as node classification, network clustering, and link prediction. Most, if not all, of the existing works, are overwhelmingly performed in the context of plain and static networks. Nonetheless, in reality, network structure often evolves over time with addition/deletion of links and nodes. Also, a vast majority of real-world networks are associated with a rich set of node attributes, and their attribute values are also naturally changing, with the emerging of new content patterns and the fading of old content patterns. These changing characteristics motivate us to seek an effective embedding representation to capture network and attribute evolving patterns, which is of fundamental importance for learning in a dynamic environment. To our best knowledge, we are the first to tackle this problem with the following two challenges: (1) the inherently correlated network and node attributes could be noisy and incomplete, it necessitates a robust consensus representation to capture their individual properties and correlations; (2) the embedding learning needs to be performed in an online fashion to adapt to the changes accordingly. In this paper, we tackle this problem by proposing a novel dynamic attributed network embedding framework - DANE. In particular, DANE first provides an offline method for a consensus embedding and then leverages matrix perturbation theory to maintain the freshness of the end embedding results in an online manner. We perform extensive experiments on both synthetic and real attributed networks to corroborate the effectiveness and efficiency of the proposed framework.Comment: 10 page

    Essays on Machine Learning in Risk Management, Option Pricing, and Insurance Economics

    Get PDF
    Dealing with uncertainty is at the heart of financial risk management and asset pricing. This cumulative dissertation consists of four independent research papers that study various aspects of uncertainty, from estimation and model risk over the volatility risk premium to the measurement of unobservable variables. In the first paper, a non-parametric estimator of conditional quantiles is proposed that builds on methods from the machine learning literature. The so-called leveraging estimator is discussed in detail and analyzed in an extensive simulation study. Subsequently, the estimator is used to quantify the estimation risk of Value-at-Risk and Expected Shortfall models. The results suggest that there are significant differences in the estimation risk of various GARCH-type models while in general estimation risk for the Expected Shortfall is higher than for the Value-at-Risk. In the second paper, the leveraging estimator is applied to realized and implied volatility estimates of US stock options to empirically test if the volatility risk premium is priced in the cross-section of option returns. A trading strategy that is long (short) in a portfolio with low (high) implied volatility conditional on the realized volatility yields average monthly returns that are economically and statistically significant. The third paper investigates the model risk of multivariate Value-at-Risk and Expected Shortfall models in a comprehensive empirical study on copula GARCH models. The paper finds that model risk is economically significant, especially high during periods of financial turmoil, and mainly due to the choice of the copula. In the fourth paper, the relation between digitalization and the market value of US insurers is analyzed. Therefore, a text-based measure of digitalization building on the Latent Dirichlet Allocation is proposed. It is shown that a rise in digitalization efforts is associated with an increase in market valuations.:1 Introduction 1.1 Motivation 1.2 Conditional quantile estimation via leveraging optimal quantization 1.3 Cross-section of option returns and the volatility risk premium 1.4 Marginals versus copulas: Which account for more model risk in multivariate risk forecasting? 1.5 Estimating the relation between digitalization and the market value of insurers 2 Conditional Quantile Estimation via Leveraging Optimal Quantization 2.1 Introduction 2.2 Optimal quantization 2.3 Conditional quantiles through leveraging optimal quantization 2.4 The hyperparameters N, λ, and γ 2.5 Simulation study 2.6 Empirical application 2.7 Conclusion 3 Cross-Section of Option Returns and the Volatility Risk Premium 3.1 Introduction 3.2 Capturing the volatility risk premium 3.3 Empirical study 3.4 Robustness checks 3.5 Conclusion 4 Marginals Versus Copulas: Which Account for More Model Risk in Multivariate Risk Forecasting? 4.1 Introduction 4.2 Market risk models and model risk 4.3 Data 4.4 Analysis of model risk 4.5 Model risk for models in the model confidence set 4.6 Model risk and backtesting 4.7 Conclusion 5 Estimating the Relation Between Digitalization and the Market Value of Insurers 5.1 Introduction 5.2 Measuring digitalization using LDA 5.3 Financial data & empirical strategy 5.4 Estimation results 5.5 Conclusio

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Topics in high dimensional energy forecasting

    Get PDF
    The forecasting of future energy consumption and generation is now an essential part of power system operation. In networks with high renewable power penetration, forecasts are used to help maintain security of supply and to operate the system efficiently. Historically, uncertainties have always been present in the demand side of the network, they are now also present in the generation side with the growth of weather dependent renewables. Here, we focus on forecasting for wind energy applications at the day(s)- ahead scale. Most of the work developed is for power forecasting, although we also identify an emerging opportunity in access forecasting for offshore operations. Power forecasts are used by traders, power system operators, and asset owners to optimise decision making based on future generation. Several novel methodologies are presented based on post–processing Numerical Weather Predictions (NWP) with measured data, using modern statistical learning techniques; they are linked with the increasingly relevant challenge of dealing with high-dimensional data. The term ‘high-dimensional’ means different things to different people, depending on their background. To statisticians high dimensionaility occurs when the dimensions of the problem are greater than the number of observations, i.e. the classic p >> n problem, an example of which can be found in Chapter 7. In this work we take the more general view that a high dimensional dataset is one with a high number of attributes or features. In wind energy forecasting applications, this can occur in the input and/or output variable space. For example, multivariate forecasting of spatially distributed wind farms can be a potentially very-high dimensional problem, but so is feature engineering using ultra-high resolution NWP in this framework. Most of the work in this thesis is based on various forms of probabilistic forecasting Probabilistic forecasts are essential for risk-management, but also to risk-neutral participants in asymmetrically penalised electricity markets. Uncertainty is always present, it is merely hidden in deterministic, i.e. point, forecasts. This aspect of forecasting has been the subject of a concerted research effort over the last few years in the energy forecasting literature. However, we identify and address gaps in the literature related to dealing with high dimensional data in both the input and output side of the modelling chain. It is not necessarily given that increasing the resolution of the weather forecast increases the skill, and therefore reduces errors associated with the forecast. In fact and when regarding typical average scoring rules, they often perform worse than smoother forecasts from lower-resolution models due to spatial and/or temporal displacement errors. Here, we evaluate the potential of using ultra high resolution weather models for offshore power forecasting, using feature engineering and modern statistical learning techniques. Two methods for creating improved probabilistic wind power forecasts through the use of turbine-level data are proposed. Although standard resolution NWP data is used, high dimensionality is now present in the output variable space; the two methods scale by the number of turbines present in the wind farm, although to a different extent. A methodology for regime-switching multivariate wind power forecasting is also elaborated, with a case study demonstrated on 92 wind balancing mechanism units connected to the GB network. Finally, we look at an emerging topic in energy forecasting: offshore access forecasting. Improving access is a priority in the offshore wind sector, driven by the opportunity to increase revenues, reduce costs, and improve safety at operational wind farms. We describe a novel methodology for producing probabilistic forecasts of access conditions during crew transfers.The forecasting of future energy consumption and generation is now an essential part of power system operation. In networks with high renewable power penetration, forecasts are used to help maintain security of supply and to operate the system efficiently. Historically, uncertainties have always been present in the demand side of the network, they are now also present in the generation side with the growth of weather dependent renewables. Here, we focus on forecasting for wind energy applications at the day(s)- ahead scale. Most of the work developed is for power forecasting, although we also identify an emerging opportunity in access forecasting for offshore operations. Power forecasts are used by traders, power system operators, and asset owners to optimise decision making based on future generation. Several novel methodologies are presented based on post–processing Numerical Weather Predictions (NWP) with measured data, using modern statistical learning techniques; they are linked with the increasingly relevant challenge of dealing with high-dimensional data. The term ‘high-dimensional’ means different things to different people, depending on their background. To statisticians high dimensionaility occurs when the dimensions of the problem are greater than the number of observations, i.e. the classic p >> n problem, an example of which can be found in Chapter 7. In this work we take the more general view that a high dimensional dataset is one with a high number of attributes or features. In wind energy forecasting applications, this can occur in the input and/or output variable space. For example, multivariate forecasting of spatially distributed wind farms can be a potentially very-high dimensional problem, but so is feature engineering using ultra-high resolution NWP in this framework. Most of the work in this thesis is based on various forms of probabilistic forecasting Probabilistic forecasts are essential for risk-management, but also to risk-neutral participants in asymmetrically penalised electricity markets. Uncertainty is always present, it is merely hidden in deterministic, i.e. point, forecasts. This aspect of forecasting has been the subject of a concerted research effort over the last few years in the energy forecasting literature. However, we identify and address gaps in the literature related to dealing with high dimensional data in both the input and output side of the modelling chain. It is not necessarily given that increasing the resolution of the weather forecast increases the skill, and therefore reduces errors associated with the forecast. In fact and when regarding typical average scoring rules, they often perform worse than smoother forecasts from lower-resolution models due to spatial and/or temporal displacement errors. Here, we evaluate the potential of using ultra high resolution weather models for offshore power forecasting, using feature engineering and modern statistical learning techniques. Two methods for creating improved probabilistic wind power forecasts through the use of turbine-level data are proposed. Although standard resolution NWP data is used, high dimensionality is now present in the output variable space; the two methods scale by the number of turbines present in the wind farm, although to a different extent. A methodology for regime-switching multivariate wind power forecasting is also elaborated, with a case study demonstrated on 92 wind balancing mechanism units connected to the GB network. Finally, we look at an emerging topic in energy forecasting: offshore access forecasting. Improving access is a priority in the offshore wind sector, driven by the opportunity to increase revenues, reduce costs, and improve safety at operational wind farms. We describe a novel methodology for producing probabilistic forecasts of access conditions during crew transfers

    Methodological contributions to the challenges and opportunities of high dimensional clustering in the context of single-cell data

    Get PDF
    With the sequencing of single cells it is possible to measure gene expression of each single-cell in contrast to bulk sequencing which enables only average gene expression. This procedure provides access to read counts for each single cell and allows the development of methods such that single cells are automatically allocated to cell types. The determination of cell types is decisive for the analysis of diseases and to understand human health based on the genetic profile of single cells. It is of common use that cell types are allocated using clustering procedures that have been developed explicitly for single-cell data. For that purpose the single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483-486, 2017) is part of the leading clustering methods in this context and is also of relevance for the following contributions. This PhD thesis aims at the development of appropriate analysis techniques for the clustering of high-dimensional single-cell data and their reliable validation. It also provides a simulation framework for the investigation of the influence of distorted measurements of single cells towards clustering performance. We further incorporate cluster indices as informative weights into the regularized regression, which allows a soft filtering of variables
    • …
    corecore