30 research outputs found

    Stratified stochastic variational inference for high-dimensional network factor model

    Get PDF
    There has been considerable recent interest in Bayesian modeling of high-dimensional networks via latent space approaches. When the number of nodes increases, estimation based on Markov Chain Monte Carlo can be extremely slow and show poor mixing, thereby motivating research on alternative algorithms that scale well in high-dimensional settings. In this article, we focus on the latent factor model, a widely used approach for latent space modeling of network data. We develop scalable algorithms to conduct approximate Bayesian inference via stochastic optimization. Leveraging sparse representations of network data, the proposed algorithms show massive computational and storage benefits, and allow to conduct inference in settings with thousands of nodes.Comment: 25 pages, 1 figures. Corrected compilation issues and minor typo

    Removing the influence of a group variable in high-dimensional predictive modelling

    Full text link
    In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors, or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce poor out-of-sample predictions which propagate these undesirable correlations. We propose a method to pre-process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high-dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the group variable. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine-specific correlations from brain scan data, and removing race and ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach.Comment: Update. 18 pages, 3 figure

    Dynamic modelling of mortality via mixtures of skewed distribution functions

    Get PDF
    In this article, we propose a novel dynamic Bayesian approach for modeling and forecasting the age-at-death distribution, focusing on a three-components mixture of a Dirac mass, a Gaussian distribution and a Skew-Normal distribution. According to the specified model, the age-at-death distribution is characterized via seven parameters corresponding to the main aspects of infant, adult and old-age mortality. The proposed approach focuses on coherent modeling of multiple countries, and following a Bayesian approach to inference we allow to borrow information across populations and to shrink parameters towards a common mean level, implicitly penalizing diverging scenarios

    Dynamic modeling of mortality via mixtures of skewed distribution functions

    Get PDF
    There has been growing interest on forecasting mortality. In this article, we propose a novel dynamic Bayesian approach for modeling and forecasting the age-at-death distribution, focusing on a three-components mixture of a Dirac mass, a Gaussian distribution and a Skew-Normal distribution. According to the specified model, the age-at-death distribution is characterized via seven parameters corresponding to the main aspects of infant, adult and old-age mortality. The proposed approach focuses on coherent modeling of multiple countries, and following a Bayesian approach to inference we allow to borrow information across populations and to shrink parameters towards a common mean level, implicitly penalizing diverging scenarios. Dynamic modeling across years is induced trough an hierarchical dynamic prior distribution that allows to characterize the temporal evolution of each mortality component and to forecast the age-at-death distribution. Empirical results on multiple countries indicate that the proposed approach outperforms popular methods for forecasting mortality, providing interpretable insights on the evolution of mortality

    Removing the influence of a group variable in high-dimensional predictive modelling

    Get PDF
    In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce out-of-sample predictions which propagate these undesirable correlations. We propose a method to pre-process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high-dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the nuisance variables. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine-specific correlations from brain scan data, and removing ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach

    Projected tt-SNE for batch correction

    Get PDF
    Biomedical research often produces high-dimensional data confounded by batch effects such as systematic experimental variations, different protocols and subject identifiers. Without proper correction, low-dimensional representation of high-dimensional data might encode and reproduce the same systematic variations observed in the original data, and compromise the interpretation of the results. In this article, we propose a novel procedure to remove batch effects from low-dimensional embeddings obtained with t-SNE dimensionality reduction. The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumors.Comment: 16 pages, 3 figure

    α-Synuclein is a Novel Microtubule Dynamase.

    Get PDF
    α-Synuclein is a presynaptic protein associated to Parkinson's disease, which is unstructured when free in the cytoplasm and adopts α helical conformation when bound to vesicles. After decades of intense studies, α-Synuclein physiology is still difficult to clear up due to its interaction with multiple partners and its involvement in a pletora of neuronal functions. Here, we looked at the remarkably neglected interplay between α-Synuclein and microtubules, which potentially impacts on synaptic functionality. In order to identify the mechanisms underlying these actions, we investigated the interaction between purified α-Synuclein and tubulin. We demonstrated that α-Synuclein binds to microtubules and tubulin α2β2 tetramer; the latter interaction inducing the formation of helical segment(s) in the α-Synuclein polypeptide. This structural change seems to enable α-Synuclein to promote microtubule nucleation and to enhance microtubule growth rate and catastrophe frequency, both in vitro and in cell. We also showed that Parkinson's disease-linked α-Synuclein variants do not undergo tubulin-induced folding and cause tubulin aggregation rather than polymerization. Our data enable us to propose α-Synuclein as a novel, foldable, microtubule-dynamase, which influences microtubule organisation through its binding to tubulin and its regulating effects on microtubule nucleation and dynamics

    Bayesian modelling of complex dependence structures

    Get PDF
    Complex dependence structures characterising modern data are routinely encountered in a large variety of research fields. Medicine, biology, psychology and social sciences are enriched by intricate architectures such as networks, tensors and more generally high-dimensional dependent data. Rich dependence structures stimulate challenging research questions and open wide methodological avenues in different areas of statistical research, providing an exciting atmosphere to develop innovative tools. A primary interest in statistical modelling of complex data is on adequately extracting information to conduct meaningful inference, providing reliable results in terms of uncertainty quantification and generalisability into future samples. These aims require ad-hoc statistical methodologies to appropriately characterize the dependence structures defining complex data as such, further improving the understanding of the mechanisms underlying the observed configurations. The focus of the thesis is on Bayesian modelling of complex dependence structures via latent variable constructs. This strategy characterises the dependence structure in an unobservable latent space, specifying the observed quantities as conditionally independent given a set of latent attributes, facilitating tractable posterior inference and an eloquent interpretation. The thesis is organized into three main parts, illustrating case studies from different fields of application and focused on studying modern challenges in neuroscience, psychology and criminal justice. Bayesian modelling of the complex data arising in these domains via latent features effectively provides valuable insights on different aspects of such structures, addressing the questions of interest and contributing to the scientific understanding.Strutture di dipendenza complesse sono molto diffuse in diverse applicazioni. Medicina, biologia, psicologia e scienze sociali sono arricchite da architetture complicate quali reti, tensori e più generalmente dati dipendenti ed ad alta dimensionalità. Strutture di dipendenza articolate stimolano complesse domande di ricerca ed aprono ampi spazi metodologici in diversi ambiti di ricerca statistica, creando una frizzante atmosfera nella quale sviluppare strumenti innovativi. Un obiettivo cruciale nella modellazione statistica di dati complessi consiste nell’estrazione di informazione per condurre inferenza coerente e ottenere risultati affidabili in termini di quantificazione dell’incertezza e di validità per dati futuri. Questi obiettivi necessitano di metodologie statistiche ad-hoc per caratterizzare un modo appropriato le strutture di dipendenza che definiscono dati complessi in quanto tali, migliorando ulteriormente la conoscenza dei meccanismi sottostanti tali strutture. Questa tesi si concentra sulla modellazione Bayesiana di strutture di dipendenza complessa tramite costrutti a variabili latenti. Tale strategia caratterizza la struttura di dipendenza in uno spazio latente, specificando le quantità osservate come condizionatamente indipendenti dato un insieme di attributi latenti, i quali semplificano l’inferenza a posteriori e permettono un’eloquente interpretazione. La tesi è organizzata in tre parti principali, le quali illustrano diverse applicazioni in neuroscienze, psicologia e giustizia criminale. Una modellazione Bayesiana tramite variabili latenti dei dati complessi che nascono in questi ambiti fornisce interessanti intuizioni su diversi aspetti di tali strutture, rispondendo a diverse domande di ricerca e contribuendo alla conoscenza scientifica in materia
    corecore