26 research outputs found
Clusterwise analysis for multiblock component methods
International audienceMultiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multi-block component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem-presented in this article-is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regres-B Stéphanie Bougeard 123 S. Bougeard et al. sion improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion-by means of a sequential algorithm-ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing
Model based clustering of multinomial count data
We consider the problem of inferring an unknown number of clusters in
replicated multinomial data. Under a model based clustering point of view, this
task can be treated by estimating finite mixtures of multinomial distributions
with or without covariates. Both Maximum Likelihood (ML) as well as Bayesian
estimation are taken into account. Under a Maximum Likelihood approach, we
provide an Expectation--Maximization (EM) algorithm which exploits a careful
initialization procedure combined with a ridge--stabilized implementation of
the Newton--Raphson method in the M--step. Under a Bayesian setup, a stochastic
gradient Markov chain Monte Carlo (MCMC) algorithm embedded within a prior
parallel tempering scheme is devised. The number of clusters is selected
according to the Integrated Completed Likelihood criterion in the ML approach
and estimating the number of non-empty components in overfitting mixture models
in the Bayesian case. Our method is illustrated in simulated data and applied
to two real datasets. An R package is available at
https://github.com/mqbssppe/multinomialLogitMix.Comment: to appear in ADA
Outlier detection algorithms over fuzzy data with weighted least squares
In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research
Outlier detection algorithms over fuzzy data with weighted least squares
In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research
Regression-based heterogeneity analysis to identify overlapping subgroup structure in high-dimensional data
Heterogeneity is a hallmark of complex diseases. Regression-based
heterogeneity analysis, which is directly concerned with outcome-feature
relationships, has led to a deeper understanding of disease biology. Such an
analysis identifies the underlying subgroup structure and estimates the
subgroup-specific regression coefficients. However, most of the existing
regression-based heterogeneity analyses can only address disjoint subgroups;
that is, each sample is assigned to only one subgroup. In reality, some samples
have multiple labels, for example, many genes have several biological
functions, and some cells of pure cell types transition into other types over
time, which suggest that their outcome-feature relationships (regression
coefficients) can be a mixture of relationships in more than one subgroups, and
as a result, the disjoint subgrouping results can be unsatisfactory. To this
end, we develop a novel approach to regression-based heterogeneity analysis,
which takes into account possible overlaps between subgroups and high data
dimensions. A subgroup membership vector is introduced for each sample, which
is combined with a loss function. Considering the lack of information arising
from small sample sizes, an norm penalty is developed for each membership
vector to encourage similarity in its elements. A sparse penalization is also
applied for regularized estimation and feature selection. Extensive simulations
demonstrate its superiority over direct competitors. The analysis of Cancer
Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas
shows that the proposed approach can identify an overlapping subgroup structure
with favorable performance in prediction and stability.Comment: 33 pages, 16 figure
Discriminative multi-stream postfilters based on deep learning for enhancing statistical parametric speech synthesis
Statistical parametric speech synthesis based on Hidden Markov Models has been an
important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech
has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different
qualities of those sounds and how HMM-based voices can present distinct degradation on each one.
The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.Universidad de Costa Rica/[322-B9-105]/UCR/Costa RicaUCR::Vicerrectoría de Docencia::Ingeniería::Facultad de Ingeniería::Escuela de Ingeniería Eléctric
CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS
The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
Probabilistic forecasting and interpretability in power load applications
Power load forecasting is a fundamental tool in the modern electric power generation
and distribution industry. The ability to accurately predict future behaviours of the grid,
both in the short and long term, is vital in order to adequately meet demand and scaling
requirements. Over the past few decades Machine Learning (ML) has taken center stage
in this context, with an emphasis on short-term forecasting using both traditional ML
as well as Deep-Learning (DL) models. In this dissertation, we approach forecasting not
only from the angle of improving predictive accuracy, but also with the goal of gaining
interpretability of the behavior of the electric load through models that can offer deeper
insight and extract useful information. Specifically for this reason, we focus on the use of
probabilistic models, which can shed light on valuable information about the underlying
structure of the data through the interpretation of their parameters. Furthermore, the use
of probabilistic models intrinsically provides us with a way of measuring the confidence in
our predictions through the predictive variance. Throughout the dissertation we shall focus
on two specific ideas within the greater field of power load forecasting, which will comprise
our main contributions.
The first contribution addresses the notion of power load profiling, in which ML is used
to identify profiles that represent distinct behaviours in the power load data. These profiles
have two fundamental uses: first, they can be valuable interpretability tools, as they offer
simple yet powerful descriptions of the underlying patterns hidden in the time series data;
second, they can improve forecasting accuracy by allowing us to train specialized predictive
models tailored to each individual profile. However, in most of the literature profiling
and prediction are typically performed sequentially, with an initial clustering algorithm
identifying profiles in the input data and a subsequent prediction stage where independent
regressors are trained on each profile. In this dissertation we propose a novel probabilistic
approach that couples both the profiling and predictive stages by jointly fitting a clustering
model and multiple linear regressors. In training, both the clustering of the input data
and the fitting of the regressors to the output data influence each other through a joint
likelihood function, resulting in a set of clusters that is much better suited to the prediction
task and is therefore much more relevant and informative. The model is tested on two real
world power load databases, provided by the regional transmission organizations ISO New
England and PJM Interconect LLC, in a 24-hour ahead prediction scenario. We achieve
better performance than other state of the art approaches while arriving at more consistent and informative profiles of the power load data.
Our second contribution applies the idea of multi-task prediction to the context of 24-
hour ahead forecasting. In a multi-task prediction problem there are multiple outputs that
are assumed to be correlated in some way. Identifying and exploiting these relationships can
result in much better performance as well as a better understanding of a multi-task problem.
Even though the load forecasting literature is scarce on this subject, it seems obvious to
assume that there exist important correlations between the outputs in a 24-hour prediction
scenario. To tackle this, we develop a multi-task Gaussian process model that addresses
the relationships between the outputs by assuming the existence of, and subsequently
estimating, both an inter-task covariance matrix and a multitask noise covariance matrix
that capture these important interactions. Our model improves on other multi-task Gaussian
process approaches in that it greatly reduces the number of parameters to be inferred
while maintaining the interpretability provided by the estimation and visualization of the
multi-task covariance matrices. We first test our model on a wide selection of general
synthetic and real world multi-task problems with excellent results. We then apply it to
a 24-hour ahead power load forecasting scenario using the ISO New England database,
outperforming other standard multi-task Gaussian processes and providing very useful
visual information through the estimation of the covariance matrices.La predicción de carga es una herramenta fundamental en la industria moderna de la
generación y distribución de energía eléctrica. La capacidad de estimar con precisión el
comportamiento futuro de la red, tanto a corto como a largo plazo, es vital para poder
cumplir con los requisitos de demanda y escalado en las diferentes infraestructuras. A lo largo
de las últimas décadas, el Aprendizaje Automático o Machine Learning (ML) ha tomado un
papel protagonista en este contexto, con un marcado énfasis en la predicción a corto plazo
utilizando tanto modelos de ML tradicionales como redes Deep-Learning (DL). En esta
tesis planteamos la predicción de carga no sólo con el objetivo de mejorar las prestaciones
en la estimación, sino también de ganar en la interpretabilidad del comportamiento de la
carga eléctrica a través de modelos que puedan extraer información útil. Por este motivo
nos centraremos en modelos probabilísticos, que por su naturaleza pueden arrojar luz sobre
la estructura oculta de los datos a través de la interpretación de sus parámetros. Además el
uso de modelos probabilísticos nos proporciona de forma intrínseca una medida de confianza
en la predicción a través de la estimación de la varianza predictiva. A lo largo de la tesis
nos centraremos en dos ideas concretas en el contexto de la predicción de carga eléctrica,
que conformarán nuestras aportaciónes principales.
Nuestra primera contribución plantea la idea del perfilado de la carga eléctrica, donde
se utilizan modelos de ML para identificar perfiles que representan comportamientos
diferenciables en los datos de carga. Estos perfiles tienen dos usos fundamentales: en
primer lugar son herramientas útiles para la interpretabilidad del problema ya que ofrecen
descripciones sencillas de los posibles patrones ocultos en los datos; en segundo lugar,
los perfiles pueden ser utilizados para mejorar las prestaciones de estimación, ya que permiten entrenar varios modelos predictivos especializados en cada perfil individual. Sin
embargo, en la literatura el perfilado y la predicción se presentan como eventos en cascada,
donde primero se entrena un algoritmo de clústering para detectar perfiles que luego son
utilizados para entrenar los modelos de regresión. En esta tesis proponemos un modelo
probabilístico novedoso que acopla las dos fases ajustando simultáneamente un modelo
de clústering y los correspondientes modelos de regresión. Durante el entrenamiento
ambas partes del modelo se influencian entre sí a través de una función de verosimilitud
conjunta, resultando en un conjunto de clusters que está mucho mejor adaptado a la tarea
de predicción y es por tanto mucho más relevante e informativo. En los experimentos, el
modelo es entrenado con datos reales de carga eléctrica provinientes de dos bases de datos
públicas proporcionadas por las organizaciónde de transmisión regional estadounidenses
ISO New England y PJM Interconect LLC, en un escenario de predicción a 24 horas. El
modelo obtiene mejores prestaciones que otros algoritmos competitivos, proporcionando al
mismo tiempo un conjunto de perfiles del comportamiento de la carga más consistente e
informativo.
Nuestra segunda contribución aplica la idea de predicción multi-tarea al contexto de
la estimación a 24 horas. Los problemas multi-tarea presentan múltiples salidas que se
asume están de alguna forma correladas entre sí. Identificar y aprovechar estas relaciones
puede incurrir en un incremento de las prestaciones así como un mejor entendimiento del
problema multi-tarea. A pesar de que la literatura de predicción de carga es escasa en este
sentido, parece lógico pensar que deben existir importantes correlaciones entre las salidas
de un escenario de predicción a 24 horas. Por este motivo hemos desarrollado un proceso
Gaussiano multi-tarea que recoge las relaciones entre salidas asumiendo la existencia de de
una covarianza inter-tarea así como un ruido multi-tarea. Nuestro modelo ofrece mejoras
con respecto a otras formulaciones de procesos Gaussianos multi-tarea al reducir el número
de parámetros a estimar mientras se mantiene la interpretabilidad proporcionada por la
estimación y visualizacion de las matrices de covarianza y ruido inter-tarea. Primero, en la
fase de experimentos nuestro modelo es puesto a prueba sobre una batería de bases de datos
tanto sintéticas como reales, obteniendo muy buenos resultados. A continuación se aplica
el modelo a un problema de predicción de carga a 24 horas utilizando la base de datos
de ISO New England, batiendo en prestaciones a otros procesos Gaussianos multi-tarea y
proporcionando información visual útil mediante la estimación de las matrices de covarianza
inter-tarea.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Pablo Martínez Olmos.- Secretario: Pablo Muñoz Moreno.- Vocal: José Palacio
Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain
The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio