2,286 research outputs found
Parametric Modelling of Multivariate Count Data Using Probabilistic Graphical Models
Multivariate count data are defined as the number of items of different
categories issued from sampling within a population, which individuals are
grouped into categories. The analysis of multivariate count data is a recurrent
and crucial issue in numerous modelling problems, particularly in the fields of
biology and ecology (where the data can represent, for example, children counts
associated with multitype branching processes), sociology and econometrics. We
focus on I) Identifying categories that appear simultaneously, or on the
contrary that are mutually exclusive. This is achieved by identifying
conditional independence relationships between the variables; II)Building
parsimonious parametric models consistent with these relationships; III)
Characterising and testing the effects of covariates on the joint distribution
of the counts. To achieve these goals, we propose an approach based on
graphical probabilistic models, and more specifically partially directed
acyclic graphs
Gene network inference by fusing data from diverse distributions
Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies
Gene network inference by fusing data from diverse distributions
Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies
Model-based clustering with data correction for removing artifacts in gene expression data
The NIH Library of Integrated Network-based Cellular Signatures (LINCS)
contains gene expression data from over a million experiments, using Luminex
Bead technology. Only 500 colors are used to measure the expression levels of
the 1,000 landmark genes measured, and the data for the resulting pairs of
genes are deconvolved. The raw data are sometimes inadequate for reliable
deconvolution leading to artifacts in the final processed data. These include
the expression levels of paired genes being flipped or given the same value,
and clusters of values that are not at the true expression level. We propose a
new method called model-based clustering with data correction (MCDC) that is
able to identify and correct these three kinds of artifacts simultaneously. We
show that MCDC improves the resulting gene expression data in terms of
agreement with external baselines, as well as improving results from subsequent
analysis.Comment: 28 page
A Posterior Probability Approach for Gene Regulatory Network Inference in Genetic Perturbation Data
Inferring gene regulatory networks is an important problem in systems
biology. However, these networks can be hard to infer from experimental data
because of the inherent variability in biological data as well as the large
number of genes involved. We propose a fast, simple method for inferring
regulatory relationships between genes from knockdown experiments in the NIH
LINCS dataset by calculating posterior probabilities, incorporating prior
information. We show that the method is able to find previously identified
edges from TRANSFAC and JASPAR and discuss the merits and limitations of this
approach
Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics
Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables.
There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi.
Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast
- …