95 research outputs found
Computational Methods for the Analysis of Complex Data
This PhD dissertation bridges the disciplines of Operations Research and Statistics to develop
novel computational methods for the extraction of knowledge from complex data. In this research,
complex data stands for datasets with many instances and/or variables, with different
types of variables, with dependence structures among the variables, collected from different
sources (heterogeneous), possibly with non-identical population class sizes, with different misclassification
costs, or characterized by extreme instances (heavy-tailed data), among others.
Recently, the complexity of the raw data in addition to new requests posed by practitioners
(interpretable models, cost-sensitive models or models which are efficient in terms of running
times) entail a challenge from a scientific perspective. The main contributions of this PhD dissertation
are encompassed in three different research frameworks: Regression, Classification
and Bayesian inference. Concerning the first, we consider linear regression models, where a
continuous outcome variable is to be predicted by a set of features. On the one hand, seeking
for interpretable solutions in heterogeneous datasets, we propose a novel version of the Lasso
in which the performance of the method on groups of interest is controlled. On the other hand,
we use mathematical optimization tools to propose a sparse linear regression model (that is, a
model whose solution only depends on a subset of predictors) specifically designed for datasets
with categorical and hierarchical features. Regarding the task of Classification, in this PhD dissertation
we have explored in depth the Naïve Bayes classifier. This method has been adapted
to obtain a sparse solution and also, it has been modified to deal with cost-sensitive datasets.
For both problems, novel strategies for reducing high running times are presented. Finally, the
last contribution of this dissertation concerns Bayesian inference methods. In particular, in the
setting of heavy-tailed data, we consider a semi-parametric Bayesian approach to estimate the
Elliptical distribution.
The structure of this dissertation is as follows. Chapter 1 contains the theoretical background
needed to develop the following chapters. In particular, two main research areas are
reviewed: sparse and cost-sensitive statistical learning and Bayesian Statistics.
Chapter 2 proposes a Lasso-based method in which quadratic performance constraints to
bound the prediction errors in the individuals of interest are added to Lasso-based objective
functions. This constrained sparse regression model is defined by a nonlinear optimization
problem. Specifically, it has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts.
Chapter 3 studies linear regression models built on categorical predictor variables that have
a hierarchical structure. The model is flexible in the sense that the user decides the level of
detail in the information used to build it, having into account data privacy considerations. To
trade off the accuracy of the linear regression model and its complexity, a Mixed Integer Convex
Quadratic Problem with Linear Constraints is solved.
In Chapter 4, a sparse version of the Naïve Bayes classifier, which is characterized by the
following three properties, is proposed. On the one hand, the selection of the subset of variables
is done in terms of the correlation structure of the predictor variables. On the other hand, such
selection can be based on different performance measures. Additionally, performance constraints
on groups of higher interest can be included. This smart search integrates the flexibility
in terms of performance for classification, yielding competitive running times.
The approach introduced in Chapter 2 is also explored in Chapter 5 for improving the performance
of the Naïve Bayes classifier in the classes of most interest to the user. Unlike the traditional
version of the classifier, which is a two-step classifier (estimation first and classification
next), the novel approach integrates both stages. The method is formulated via an optimization
problem where the likelihood function is maximized with constraints on the classification rates
for the groups of interest.
When dealing with datasets of especial characteristics (for example, heavy tails in contexts
as Economics and Finance), Bayesian statistical techniques have shown their potential in the
literature. In Chapter 6, Elliptical distributions, which are generalizations of the multivariate
normal distribution to both longer tails and elliptical contours, are examined, and Bayesian
methods to perform semi-parametric inference for them are used.
Finally, Chapter 7 closes the thesis with general conclusions and future lines of research
Sjevernojadranska voda visoke gustoće: što znamo 60 godina nakon pionirskog rada Mire Zore-Armanda
This review first pays tribute to the famous Croatian oceanographer, Mira Zore-Armanda, and her seminal work on the
Adriatic water masses in 1963, and emphasises the importance of the densest Mediterranean water mass: North Adriatic Dense
Water (NAddW). This water mass is generated through substantial wintertime surface cooling and evaporation over the wide
northern Adriatic and is known to (1) influence the Adriatic-Ionian thermohaline circulation, (2) bring oxygen and carbon to the
deep Adriatic layers and, (3) more generally, have a substantial impact on the physics and biogeochemistry of the whole Adriatic.
Second, the NAddW physics, from preconditioning, through generation and spreading, to accumulation in Adriatic depressions,
is reviewed. Then, the temporal evolution of the NAddW properties influenced and connected to (1) basin-wide interannual and
decadal variability and (2) trends towards warmer and saltier source characteristic due to ongoing climate change, is discussed. The
importance of long-term observations and atmosphere-ocean modelling in event, decadal and climate studies is then presented.
Finally, a review of the identified gaps and perspectives for future research is concluding this article.Ovaj rad daje počast hrvatskoj oceanografkinji Miri Zore-Armanda i njenom pionirskom radu iz 1963. godine o jadranskim vodenim masama, te detaljno opisuje Sjevernojadransku vodu, vodenu masu s najvišom gustoćom u Sredozemlju. Ova vodena masa se stvara za vrijeme izraženih zimskih prodora hladnog i suhog zraka nad sjevernim Jadranom, te je poznato da (1) pokreće jadransko-jonsku termohalinu cirkulaciju, (2) donosi kisik i ugljikove spojeve u duboke slojeve Jadrana, te (3) ima značajan utjecaj na fizička i biogeokemijska svojstva cijelog Jadrana. Rad daje pregled novijih istraživanja dinamike Sjevernojadranske vode u svim njenim fazama, od prekondicioniranja, stvaranja, širenja pa do akumuliranja u jadranskim kotlinama. Nakon toga diskutiraju se promjene u izvornim svojstvima (temperaturi i salinitetu) Sjevernojadranske vode i povezuju s (1) međugodišnjim i dekadskim oscilacijama u Jadranu, kao i s (2) trendovima termohalinih svojstava koji nastaju zbog klimatskih promjena. U pregledu se naglašava važnost provođenja dugoročnih mjerenja kao i razvoj numeričkih modela atmosphere i mora, namijenjenih kako istraživanju pojedinačnih događaja za kojih se stvara voda, tako i kvantificiranju trendova i varijabilnosti na dekadskim i klimatskim skalama. Naposljetku, dan je i pregled potencijalnih istraživačkih tema vezanih za Sjevernojadransku vodu koji bi mogao unaprijediti saznanja o oceanografskim procesima u Jadranu
Stochastic Surrogate Model for Meteotsunami Early Warning System in the Eastern Adriatic Sea
The meteotsunami early warning system prototype using stochastic surrogate approach and running operationally in the eastern Adriatic Sea is presented. First, the atmospheric internal gravity waves (IGWs) driving the meteotsunamis are either forecasted with stateâ ofâ theâ art deterministic models at least a day in advance or detected through measurements at least 2 hr before the meteotsunami reaches sensitive locations. The extreme seaâ level hazard forecast at endangered locations is then derived with an innovative stochastic surrogate modelâ implemented with generalized polynomial chaos expansion (gPCE) method and synthetic IGWs forcing a barotropic ocean modelâ used with the input parameters extracted from deterministic model results and/or measurements. The evaluation of the system, both against five historical events and for all the detected potential meteotsunamis since late 2018 when the early warning system prototype became operational, reveals that the meteotsunami hazard is conservatively assessed but often overestimated at some locations. Despite some needed improvements and developments, this study demonstrates that gPCEâ based methods can be used for atmospherically driven extreme seaâ level hazard assessment and in geosciences in wide.Plain Language SummaryAtmospherically driven extreme seaâ level events are one of the major threats to people and assets in the coastal regions. Assessing the hazard associated with such events together with uncertainty quantification in a precise and timely manner is thus of primary importance in modern societies. In this study, an early warning system for the eastern Adriatic meteotsunamis, destructive long waves with periods from few minutes up to an hour generated by traveling atmospheric disturbances, is presented and evaluated. The system is based on stateâ ofâ theâ art deterministic atmospheric and ocean models as well as an innovative statistical model developed to forecast the meteotsunami hazard. The evaluation reveals that the meteotsunami hazard is conservatively assessed but often overestimated. This study demonstrates that the presented methodology can be used for extreme seaâ level hazard assessment and in general for hazard studies in geosciences.Key PointsDesign and evaluation of an innovative meteotsunami early warning system prototype using stochastic surrogate approachForecast of the atmospheric internal gravity waves driving meteotsunami events with deterministic stateâ ofâ theâ art modelsStochastic surrogate model based on generalized polynomial chaos expansion methods and running at nearly no computational costPeer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/152998/1/jgrc23744.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/152998/2/jgrc23744_am.pd
Performance of the Adriatic Sea and Coast (AdriSC) climate component – a COAWST V3.3-based coupled atmosphere–ocean modelling suite: atmospheric dataset
In this evaluation study, the coupled atmosphere–ocean Adriatic Sea and Coast (AdriSC) climate model, which was implemented to carry out 31-year evaluation and climate projection simulations in the Adriatic and northern Ionian seas, is briefly presented. The kilometre-scale AdriSC atmospheric results, derived with the Weather Research and Forecasting (WRF) 3 km model for the 1987–2017 period, are then thoroughly compared to a comprehensive publicly and freely available observational dataset. The evaluation shows that overall, except for the summer surface temperatures, which are systematically underestimated, the AdriSC WRF 3 km model has a far better capacity to reproduce surface climate variables (and particularly the rain) than the WRF regional climate models at 0.11o resolution. In addition, several spurious data have been found in both gridded products and in situ measurements, which thus should be used with care in the Adriatic region for climate studies at local and regional scales. Long-term simulations with the AdriSC climate model, which couples the WRF 3 km model with a 1 km ocean model, might thus be a new avenue to substantially improve the reproduction, at the climate scale, of the Adriatic Sea dynamics driving the Eastern Mediterranean thermohaline circulation. As such it may also provide new standards for climate studies of orographically developed coastal regions in general
Variable selection for Naive Bayes classification
The Naive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Naive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.This research is partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economia y Competitividad, Spain) and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovacion y Universidades, Spain) , FQM-329 and P18-FR-2369 (Junta de Andalucia, Spain) , PR2019-029 (Universidad de Cadiz, Spain) , Fundacion BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214) . This support is gratefully acknowledged.
Documen
A new multivariate data analysis model: constrained Naïve Bayes
Universidad de Sevilla. Máster Universitario en Matemática
A cost-sensitive constrained Lasso
The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individ- uals of interest is allowed.
In this work we propose a novel version of the Lasso in which quadratic performance con- straints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization prob- lem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered
A cost-sensitive constrained Lasso
The Lasso has become a benchmark data analysis procedure, and numerous
variants have been proposed in the literature. Although the Lasso formulations
are stated so that overall prediction error is optimized, no full control over
the accuracy prediction on certain individuals of interest is allowed. In this
work we propose a novel version of the Lasso in which quadratic performance
constraints are added to Lasso-based objective functions, in such a way that
threshold values are set to bound the prediction errors in the different groups
of interest (not necessarily disjoint). As a result, a constrained sparse
regression model is defined by a nonlinear optimization problem. This
cost-sensitive constrained Lasso has a direct application in heterogeneous
samples where data are collected from distinct sources, as it is standard in
many biomedical contexts. Both theoretical properties and empirical studies
concerning the new method are explored in this paper. In addition, two
illustrations of the method on biomedical and sociological contexts are
considered
- …