11 research outputs found

    Unsupervised Learning via Mixtures of Skewed Distributions with Hypercube Contours

    Full text link
    Mixture models whose components have skewed hypercube contours are developed via a generalization of the multivariate shifted asymmetric Laplace density. Specifically, we develop mixtures of multiple scaled shifted asymmetric Laplace distributions. The component densities have two unique features: they include a multivariate weight function, and the marginal distributions are also asymmetric Laplace. We use these mixtures of multiple scaled shifted asymmetric Laplace distributions for clustering applications, but they could equally well be used in the supervised or semi-supervised paradigms. The expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. Simulated and real data sets are used to illustrate the approach and, in some cases, to visualize the skewed hypercube structure of the components

    Hidden semi-Markov-switching quantile regression for time series

    Get PDF
    A hidden semi-Markov-switching quantile regression model is introduced as an extension of the hidden Markov-switching one. The proposed model allows for arbitrary sojourn-time distributions in the states of the Markov-switching chain. Parameters estimation is carried out via maximum likelihood estimation method using the Asymmetric Laplace distribution. As a by product of the model specification, the formulae and methods for forecasting, the state prediction, decoding and model checking that exist for ordinary hidden Markov-switching models can be applied to the proposed model. A simulation study to investigate the behaviour of the proposed model is performed covering several experimental settings. The empirical analysis studies the relationship between the stock index from the emerging market of China and those from the advanced markets, and investigates the determinants of high levels of pollution in an Italian small city.publishedVersio

    Bayesian Cluster Analysis

    Get PDF

    Quantile hidden semi-Markov models for multivariate time series

    Get PDF
    This paper develops a quantile hidden semi-Markov regression to jointly estimate multiple quantiles for the analysis of multivariate time series. The approach is based upon the Multivariate Asymmetric Laplace (MAL) distribution, which allows to model the quantiles of all univariate conditional distributions of a multivariate response simultaneously, incorporating the correlation structure among the outcomes. Unobserved serial heterogeneity across observations is modeled by introducing regime-dependent parameters that evolve according to a latent finite-state semi-Markov chain. Exploiting the hierarchical representation of the MAL, inference is carried out using an efficient Expectation-Maximization algorithm based on closed form updates for all model parameters, without parametric assumptions about the states’ sojourn distributions. The validity of the proposed methodology is analyzed both by a simulation study and through the empirical analysis of air pollutant concentrations in a small Italian city

    Model-based clustering for High Dimensional genomic data

    Get PDF
    To model-based clustering χρησιμεύει σε πολλές εφαρμογές, καθώς δίνει την δυνατότητα στατιστικής συμπερασματολογίας, σε αντίθεση με μεθόδους distance-based clustering. Εφαρμογές σε δεδομένα υψηλών διαστάσεων παρουσιάζουν κάποιες προκλήσεις. Δεδομένα υψηλών διαστάσεων συναντώνται συχνά σε γενετικά πειράματα. Έτσι το clustering ατόμων/ασθενών θεωρείται δύσκολο αλλά σημαντικό εγχείρημα. Η παρούσα διπλωματική εργασία περιλαμβάνει μία ανασκόπηση των μοντέλων πεπερασμένων μίξεων διάφορων κατανομών. Η παραγοντική ανάλυση θεωρείται αξιόπιστο εργαλείο για την αντιμετώπιση προβλημάτων με υψηλές διαστάσεις. Έτσι, γίνεται επίσης ανασκόπηση κάποιων επεκτάσεων των μίξεων factor analyzers (MFA). Για κάθε μοντέλο της ανασκόπησης, δίνονται παραδείγματα με τη χρήση πακέτων R πάνω σε τρία διαφορετικά datasets. Τέλος, γίνεται μία εφαρμογή των MFA μοντέλων σε ένα πραγματικό dataset από ένα microarray πείραμα με γυναίκες με καρκίνο του μαστού. Γίνεται σύντομη συζήτηση για το βήμα επιλογής γονιδίων του EMMIX-GENE αλγορίθμου και γίνεται χρήση αυτού με σκοπό την μείωση των διαστάσεων του προβλήματος, ενώ χρησιμοποιούνται διάφορες τιμές για το φράγμα του αλγορίθμου, το οποίο οδηγεί σε υποσύνολα του δείγματος με διάφορα μεγέθη. Τα αποτελέσματα της εφαρμογής και η αποτυχία να ξεχωριστούν οι ασθενείς στις επιθυμητές ομάδες καλής και κακής διάγνωσης, έρχονται σε συμφωνία με παλαιότερες προσπάθειες να αντιμετωπιστεί αυτό το πρόβλημα. Η εφαρμογή των τεχνικών που παρουσιάστηκαν σε άλλου είδους γονιδιακά δεδομένα προτείνεται για μελλοντική έρευνα.Model-based clustering has found an increasing number of applications as it allows for proper statistical inference contrary to distance-based approaches. Working with high dimensional data leads to certain challenges, most common being that of over-parametrization. Such high dimensional data occur very often in most genomic experiments. Thus clustering individuals/patients is becoming a challenging but important task. The current thesis includes a review of finite mixture models of various distributions. Factor Analysis is considered to be a reliable tool for dealing with high dimensions. Hence, some extensions of the mixtures of factor analyzers (MFA) model are presented. For each model reviewed, examples of model-based clustering with the use of R packages were given, using three different datasets. The thesis concludes with an application of the extensions of MFA models on a real dataset from a microarray experiment with women with breast cancer. The gene-selection step of the EMMIX-GENE algorithm is briefly discussed and used for reasons of dimension reduction of the dataset, while different values were used for the threshold, leading to sub-samples of different sizes. The results of this application and the failure to distinguish the women’s tissue samples to the desired groups of “good prognosis” and “bad prognosis”, fall in agreement with previous attempts to tackle the same problem. The application of those techniques in different kinds of genomic data is recommended for future research

    Methods for Merging, Parsimony and Interpretability of Finite Mixture Models

    Get PDF
    To combat the increasing data dimensionality, parsimonious modelling for finite mixture models has risen to be an active research area. These modelling frameworks offer various constraints that can reduce the number of free parameters in a finite mixture model. However, the constraint selection process is not always clear to the user. Moreover, the relationship between the chosen constraint and the data set is often left unexplained. Such issues affect adversely the interpretability of the fitted model. That is, one may end up with a model with reduced number of free parameters, but how it was selected, and what the parameter-reducing constraints mean, remain mysterious. Over-estimation of the mixture component count is another way in which the model interpretability may suffer. When the individual components of a mixture model fail to capture adequately the underlying clusters of a data set, the model may compensate by introducing extra components, thereby representing a single cluster with multiple components. This reality challenges the common assumption that a single component represents a cluster. Addressing the interpretability-related issues can improve the informativeness of model-based clustering, thereby better assisting the user during the exploratory analysis and/or data segmentation step

    The distribution of autistic traits across the autism spectrum: Evidence for discontinuous dimensional subpopulations underlying the autism continuum

    Get PDF
    BACKGROUND A considerable amount of research has discussed whether autism, and psychiatric/neurodevelopmental conditions in general are best described categorically or dimensionally. In recent years, finite mixture models have been increasingly applied to mixed populations of autistic and non-autistic individuals to answer this question. However, the use of such methods with mixed populations may not be appropriate for two reasons: First, subgroups within mixed populations are often skewed, and thus violate mixture models assumptions, which are based on weighted sum of Gaussian distributions. Second, these analyses have, to our knowledge, been solely applied to enriched samples, where the prevalence of the clinical condition within the study sample far exceeds epidemiological estimates. METHOD We employed a dual Weibull Mixture model to examine the distribution of the Autism Spectrum Quotient scores of a mixed sample of autistic and non-autistic adults (N = 4717; autism = 811), as well as of a derived sample (from the enriched sample; N = 3973; autism = 67) that reflects the current prevalence of autism within the general population. RESULTS In a mixed autistic and non-autistic population, our model provided a better description of the underlying structure of autistic traits than traditional finite Gaussian mixture models, and performed well when applied to a sample that reflected the prevalence of autism in the general population. The model yielded results, which are consistent with predictions of current theories advocating for the co-existence of a mixed categorical and dimensional architecture within the autism spectrum. CONCLUSION The results provide insight into the continuum nature of the distribution of autistic traits, support the complementary role of both categorical and dimensional approaches to autism spectrum condition, and underscore the importance of analysing samples that reflect the epidemiological prevalence of the condition. Owing to its flexibility to represent a wide variety of distributions, the Weibull distribution might be better suited for latent structure studies, within enriched and prevalence-true samples.The project leading to this application has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 777394. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA and AUTISM SPEAKS, Autistica, SFARI This work was supported by the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care (CLAHRC) East of England at Cambridgeshire and Peterborough NHS Foundation Trust, the Medical Research Council, the NIHR Cambridge Biomedical Research Centre, and the Autism Research Trust. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health

    Statistical and machine learning for credit and market risk management

    Get PDF
    Finanzinstitute spielen eine wichtige Rolle für die Stabilität des Finanzsektors. Sie bekleiden eine entscheidende Rolle als Intermediäre bei der Bereitstellung von Geld und Krediten sowie bei der Übertragung von Risiken zwischen Unternehmen. Diese Intermediärfunktion setzt die Finanzinstitute jedoch verschiedenen Arten von Risiken aus. Die Identifizierung und Messung dieser Risiken ist besonders in schwierigen Zeiten wichtig, in denen ein angeschlagener Finanzsektor zu einem Rückgang der Kreditvergabe führen kann. Vor allem in Zeiten des wirtschaftlichen Abschwungs ist die Rolle der Bereitstellung von Liquidität und Krediten wichtiger denn je. Daher ist die genaue Schätzung der Determinanten für verschiedene Risikoquellen eine äußerst wichtige Aufgabe für die Wirtschaft im Allgemeinen und für Finanzinstitute im Besonderen. In den letzten Jahrzehnten sind die Rechenleistung und die Speicherkapazitäten erheblich gestiegen, während die Kosten stark gesunken sind. Dies ermöglicht es Forschern und Praktikern, fortschrittlichere und rechenintensivere Modelle zu verwenden. Dies ist besonders wichtig für Modelle des maschinellen Lernens, aber auch für Bayesianische Modelle. Die Arbeit beleuchtet die Anwendung fortgeschrittener statistischer und maschineller Lernmethoden für das Kredit- und Marktrisikomanagement. Diese Anwendungen werden in vier unabhängigen Forschungsarbeiten behandelt. Die erste befasst sich mit fortgeschrittenen Bayesianischen Methoden, um den schwierigen Risikoparameter Exposure at Default (EAD) und sein Verhalten in Abschwungphasen zu untersuchen. Das zweite Papier konzentriert sich auf die Kombination von statistischen und maschinellen Lernmethoden, um verschiedene Aspekte der Verlustquote (Loss Given Default, LGD) zu eruieren, wobei ein besonderer Schwerpunkt auf Methoden zur Erklärbarkeit von Maschinellem Lernen liegt. Das dritte Forschungspapier wendet neuronale Netze für die Kalibrierung von Finanzmodellen an, wobei ein besonderer Schwerpunkt auf ihrem Nutzen in der Praxis liegt. Das letzte Forschungspapier befasst sich eingehend mit Nichtlinearität, die mit den Bewegungen der Aktienmärkte einhergeht
    corecore