Search CORE

174 research outputs found

Recommended from our members

Appropriate, accessible and appealing probabilistic graphical models

Author: Inouye David Iseri
Publication venue
Publication date: 13/12/2017
Field of study

Appropriate - Many multivariate probabilistic models either use independent distributions or dependent Gaussian distributions. Yet, many real-world datasets contain count-valued or non-negative skewed data, e.g. bag-of-words text data and biological sequencing data. Thus, we develop novel probabilistic graphical models for use on count-valued and non-negative data including Poisson graphical models and multinomial graphical models. We develop one generalization that allows for triple-wise or k-wise graphical models going beyond the normal pairwise formulation. Furthermore, we also explore Gaussian-copula graphical models and derive closed-form solutions for the conditional distributions and marginal distributions (both before and after conditioning). Finally, we derive mixture and admixture, or topic model, generalizations of these graphical models to introduce more power and interpretability. Accessible - Previous multivariate models, especially related to text data, often have complex dependencies without a closed form and require complex inference algorithms that have limited theoretical justification. For example, hierarchical Bayesian models often require marginalizing over many latent variables. We show that our novel graphical models (even the k-wise interaction models) have simple and intuitive estimation procedures based on node-wise regressions that likely have similar theoretical guarantees as previous work in graphical models. For the copula-based graphical models, we show that simple approximations could still provide useful models; these copula models also come with closed-form conditional and marginal distributions, which make them amenable to exploratory inspection and manipulation. The parameters of these models are easy to interpret and thus may be accessible to a wide audience. Appealing - High-level visualization and interpretation of graphical models with even 100 variables has often been difficult even for a graphical model expert---despite visualization being one of the original motivators for graphical models. This difficulty is likely due to the lack of collaboration between graphical model experts and visualization experts. To begin bridging this gap, we develop a novel "what if?" interaction that manipulates and leverages the probabilistic power of graphical models. Our approach defines: the probabilistic mechanism via conditional probability; the query language to map text input to a conditional probability query; and the formal underlying probabilistic model. We then propose to visualize these query-specific probabilistic graphical models by combining the intuitiveness of force-directed layouts with the beauty and readability of word clouds, which pack many words into valuable screen space while ensuring words do not overlap via pixel-level collision detection. Although both the force-directed layout and the pixel-level packing problems are challenging in their own right, we approximate both simultaneously via adaptive simulated annealing starting from careful initialization. For visualizing mixture distributions, we also design a meaningful mapping from the properties of the mixture distribution to a color in the perceptually uniform CIELUV color space. Finally, we demonstrate our approach via illustrative visualizations of several real-world datasets.Computer Science

Texas ScholarWorks

Comparing linear discriminant analysis and supervised learning algorithms for binary classification - a method comparison study

Author: Friedrich Sarah
Graf Ricarda
Zeldovich Marina
Publication venue: 'Wiley'
Publication date: 01/01/2022
Field of study

In psychology, linear discriminant analysis (LDA) is the method of choice for two-group classification tasks based on questionnaire data. In this study, we present a comparison of LDA with several supervised learning algorithms. In particular, we examine to what extent the predictive performance of LDA relies on the multivariate normality assumption. As nonparametric alternatives, the linear support vector machine (SVM), classification and regression tree (CART), random forest (RF), probabilistic neural network (PNN), and the ensemble k conditional nearest neighbor (EkCNN) algorithms are applied. Predictive performance is determined using measures of overall performance, discrimination, and calibration, and is compared in two reference data sets as well as in a simulation study. The reference data are Likert-type data, and comprise 5 and 10 predictor variables, respectively. Simulations are based on the reference data and are done for a balanced and an unbalanced scenario in each case. In order to compare the algorithms' performance, data are simulated from multivariate distributions with differing degrees of nonnormality. Results differ depending on the specific performance measure. The main finding is that LDA is always outperformed by RF in the bimodal data with respect to overall performance. Discriminative ability of the RF algorithm is often higher compared to LDA, but its model calibration is usually worse. Still LDA mostly ranges second in cases it is outperformed by another algorithm, or the differences are only marginal. In consequence, we still recommend LDA for this type of application

OPUS Augsburg

Attributed Network Embedding for Learning in a Dynamic Environment

Author: Chang Yi
Dani Harsh
Hu Xia
Li Jundong
Liu Huan
Tang Jiliang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/08/2018
Field of study

Network embedding leverages the node proximity manifested to learn a low-dimensional node vector representation for each node in the network. The learned embeddings could advance various learning tasks such as node classification, network clustering, and link prediction. Most, if not all, of the existing works, are overwhelmingly performed in the context of plain and static networks. Nonetheless, in reality, network structure often evolves over time with addition/deletion of links and nodes. Also, a vast majority of real-world networks are associated with a rich set of node attributes, and their attribute values are also naturally changing, with the emerging of new content patterns and the fading of old content patterns. These changing characteristics motivate us to seek an effective embedding representation to capture network and attribute evolving patterns, which is of fundamental importance for learning in a dynamic environment. To our best knowledge, we are the first to tackle this problem with the following two challenges: (1) the inherently correlated network and node attributes could be noisy and incomplete, it necessitates a robust consensus representation to capture their individual properties and correlations; (2) the embedding learning needs to be performed in an online fashion to adapt to the changes accordingly. In this paper, we tackle this problem by proposing a novel dynamic attributed network embedding framework - DANE. In particular, DANE first provides an offline method for a consensus embedding and then leverages matrix perturbation theory to maintain the freshness of the end embedding results in an online manner. We perform extensive experiments on both synthetic and real attributed networks to corroborate the effectiveness and efficiency of the proposed framework.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Essays on Machine Learning in Risk Management, Option Pricing, and Insurance Economics

Author: Fritzsch Simon
Publication venue
Publication date: 05/07/2022
Field of study

Dealing with uncertainty is at the heart of financial risk management and asset pricing. This cumulative dissertation consists of four independent research papers that study various aspects of uncertainty, from estimation and model risk over the volatility risk premium to the measurement of unobservable variables. In the first paper, a non-parametric estimator of conditional quantiles is proposed that builds on methods from the machine learning literature. The so-called leveraging estimator is discussed in detail and analyzed in an extensive simulation study. Subsequently, the estimator is used to quantify the estimation risk of Value-at-Risk and Expected Shortfall models. The results suggest that there are significant differences in the estimation risk of various GARCH-type models while in general estimation risk for the Expected Shortfall is higher than for the Value-at-Risk. In the second paper, the leveraging estimator is applied to realized and implied volatility estimates of US stock options to empirically test if the volatility risk premium is priced in the cross-section of option returns. A trading strategy that is long (short) in a portfolio with low (high) implied volatility conditional on the realized volatility yields average monthly returns that are economically and statistically significant. The third paper investigates the model risk of multivariate Value-at-Risk and Expected Shortfall models in a comprehensive empirical study on copula GARCH models. The paper finds that model risk is economically significant, especially high during periods of financial turmoil, and mainly due to the choice of the copula. In the fourth paper, the relation between digitalization and the market value of US insurers is analyzed. Therefore, a text-based measure of digitalization building on the Latent Dirichlet Allocation is proposed. It is shown that a rise in digitalization efforts is associated with an increase in market valuations.:1 Introduction 1.1 Motivation 1.2 Conditional quantile estimation via leveraging optimal quantization 1.3 Cross-section of option returns and the volatility risk premium 1.4 Marginals versus copulas: Which account for more model risk in multivariate risk forecasting? 1.5 Estimating the relation between digitalization and the market value of insurers 2 Conditional Quantile Estimation via Leveraging Optimal Quantization 2.1 Introduction 2.2 Optimal quantization 2.3 Conditional quantiles through leveraging optimal quantization 2.4 The hyperparameters N, λ, and γ 2.5 Simulation study 2.6 Empirical application 2.7 Conclusion 3 Cross-Section of Option Returns and the Volatility Risk Premium 3.1 Introduction 3.2 Capturing the volatility risk premium 3.3 Empirical study 3.4 Robustness checks 3.5 Conclusion 4 Marginals Versus Copulas: Which Account for More Model Risk in Multivariate Risk Forecasting? 4.1 Introduction 4.2 Market risk models and model risk 4.3 Data 4.4 Analysis of model risk 4.5 Model risk for models in the model confidence set 4.6 Model risk and backtesting 4.7 Conclusion 5 Estimating the Relation Between Digitalization and the Market Value of Insurers 5.1 Introduction 5.2 Measuring digitalization using LDA 5.3 Financial data & empirical strategy 5.4 Estimation results 5.5 Conclusio

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

Author: Ferreira N.
Oliveira M.
Publication venue: CFE and CMStatistics networks
Publication date: 01/01/2015
Field of study

The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

Repositório Institucional do ISCTE-IUL

Topics in high dimensional energy forecasting

Author: Gilbert Ciaran.
Publication venue
Publication date
Field of study

The forecasting of future energy consumption and generation is now an essential part of power system operation. In networks with high renewable power penetration, forecasts are used to help maintain security of supply and to operate the system efficiently. Historically, uncertainties have always been present in the demand side of the network, they are now also present in the generation side with the growth of weather dependent renewables. Here, we focus on forecasting for wind energy applications at the day(s)- ahead scale. Most of the work developed is for power forecasting, although we also identify an emerging opportunity in access forecasting for offshore operations. Power forecasts are used by traders, power system operators, and asset owners to optimise decision making based on future generation. Several novel methodologies are presented based on post–processing Numerical Weather Predictions (NWP) with measured data, using modern statistical learning techniques; they are linked with the increasingly relevant challenge of dealing with high-dimensional data. The term ‘high-dimensional’ means different things to different people, depending on their background. To statisticians high dimensionaility occurs when the dimensions of the problem are greater than the number of observations, i.e. the classic p >> n problem, an example of which can be found in Chapter 7. In this work we take the more general view that a high dimensional dataset is one with a high number of attributes or features. In wind energy forecasting applications, this can occur in the input and/or output variable space. For example, multivariate forecasting of spatially distributed wind farms can be a potentially very-high dimensional problem, but so is feature engineering using ultra-high resolution NWP in this framework. Most of the work in this thesis is based on various forms of probabilistic forecasting Probabilistic forecasts are essential for risk-management, but also to risk-neutral participants in asymmetrically penalised electricity markets. Uncertainty is always present, it is merely hidden in deterministic, i.e. point, forecasts. This aspect of forecasting has been the subject of a concerted research effort over the last few years in the energy forecasting literature. However, we identify and address gaps in the literature related to dealing with high dimensional data in both the input and output side of the modelling chain. It is not necessarily given that increasing the resolution of the weather forecast increases the skill, and therefore reduces errors associated with the forecast. In fact and when regarding typical average scoring rules, they often perform worse than smoother forecasts from lower-resolution models due to spatial and/or temporal displacement errors. Here, we evaluate the potential of using ultra high resolution weather models for offshore power forecasting, using feature engineering and modern statistical learning techniques. Two methods for creating improved probabilistic wind power forecasts through the use of turbine-level data are proposed. Although standard resolution NWP data is used, high dimensionality is now present in the output variable space; the two methods scale by the number of turbines present in the wind farm, although to a different extent. A methodology for regime-switching multivariate wind power forecasting is also elaborated, with a case study demonstrated on 92 wind balancing mechanism units connected to the GB network. Finally, we look at an emerging topic in energy forecasting: offshore access forecasting. Improving access is a priority in the offshore wind sector, driven by the opportunity to increase revenues, reduce costs, and improve safety at operational wind farms. We describe a novel methodology for producing probabilistic forecasts of access conditions during crew transfers.The forecasting of future energy consumption and generation is now an essential part of power system operation. In networks with high renewable power penetration, forecasts are used to help maintain security of supply and to operate the system efficiently. Historically, uncertainties have always been present in the demand side of the network, they are now also present in the generation side with the growth of weather dependent renewables. Here, we focus on forecasting for wind energy applications at the day(s)- ahead scale. Most of the work developed is for power forecasting, although we also identify an emerging opportunity in access forecasting for offshore operations. Power forecasts are used by traders, power system operators, and asset owners to optimise decision making based on future generation. Several novel methodologies are presented based on post–processing Numerical Weather Predictions (NWP) with measured data, using modern statistical learning techniques; they are linked with the increasingly relevant challenge of dealing with high-dimensional data. The term ‘high-dimensional’ means different things to different people, depending on their background. To statisticians high dimensionaility occurs when the dimensions of the problem are greater than the number of observations, i.e. the classic p >> n problem, an example of which can be found in Chapter 7. In this work we take the more general view that a high dimensional dataset is one with a high number of attributes or features. In wind energy forecasting applications, this can occur in the input and/or output variable space. For example, multivariate forecasting of spatially distributed wind farms can be a potentially very-high dimensional problem, but so is feature engineering using ultra-high resolution NWP in this framework. Most of the work in this thesis is based on various forms of probabilistic forecasting Probabilistic forecasts are essential for risk-management, but also to risk-neutral participants in asymmetrically penalised electricity markets. Uncertainty is always present, it is merely hidden in deterministic, i.e. point, forecasts. This aspect of forecasting has been the subject of a concerted research effort over the last few years in the energy forecasting literature. However, we identify and address gaps in the literature related to dealing with high dimensional data in both the input and output side of the modelling chain. It is not necessarily given that increasing the resolution of the weather forecast increases the skill, and therefore reduces errors associated with the forecast. In fact and when regarding typical average scoring rules, they often perform worse than smoother forecasts from lower-resolution models due to spatial and/or temporal displacement errors. Here, we evaluate the potential of using ultra high resolution weather models for offshore power forecasting, using feature engineering and modern statistical learning techniques. Two methods for creating improved probabilistic wind power forecasts through the use of turbine-level data are proposed. Although standard resolution NWP data is used, high dimensionality is now present in the output variable space; the two methods scale by the number of turbines present in the wind farm, although to a different extent. A methodology for regime-switching multivariate wind power forecasting is also elaborated, with a case study demonstrated on 92 wind balancing mechanism units connected to the GB network. Finally, we look at an emerging topic in energy forecasting: offshore access forecasting. Improving access is a priority in the offshore wind sector, driven by the opportunity to increase revenues, reduce costs, and improve safety at operational wind farms. We describe a novel methodology for producing probabilistic forecasts of access conditions during crew transfers

STAX (Strathclyde Repository)

Methodological contributions to the challenges and opportunities of high dimensional clustering in the context of single-cell data

Author: Fütterer Cornelia Sigrid
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 28/06/2022
Field of study

With the sequencing of single cells it is possible to measure gene expression of each single-cell in contrast to bulk sequencing which enables only average gene expression. This procedure provides access to read counts for each single cell and allows the development of methods such that single cells are automatically allocated to cell types. The determination of cell types is decisive for the analysis of diseases and to understand human health based on the genetic profile of single cells. It is of common use that cell types are allocated using clustering procedures that have been developed explicitly for single-cell data. For that purpose the single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483-486, 2017) is part of the leading clustering methods in this context and is also of relevance for the following contributions. This PhD thesis aims at the development of appropriate analysis techniques for the clustering of high-dimensional single-cell data and their reliable validation. It also provides a simulation framework for the investigation of the influence of distorted measurements of single cells towards clustering performance. We further incorporate cluster indices as informative weights into the regularized regression, which allows a soft filtering of variables

Digitale Hochschulschriften der LMU