2,569 research outputs found
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
In many application areas, data are collected on a categorical response and
high-dimensional categorical predictors, with the goals being to build a
parsimonious model for classification while doing inferences on the important
predictors. In settings such as genomics, there can be complex interactions
among the predictors. By using a carefully-structured Tucker factorization, we
define a model that can characterize any conditional probability, while
facilitating variable selection and modeling of higher-order interactions.
Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm
for posterior computation accommodating uncertainty in the predictors to be
included. Under near sparsity assumptions, the posterior distribution for the
conditional probability is shown to achieve close to the parametric rate of
contraction even in ultra high-dimensional settings. The methods are
illustrated using simulation examples and biomedical applications
Predictive response-relevant clustering of expression data provides insights into disease processes
This article describes and illustrates a novel method of microarray data analysis that couples model-based clustering and binary classification to form clusters of ;response-relevant' genes; that is, genes that are informative when discriminating between the different values of the response. Predictions are subsequently made using an appropriate statistical summary of each gene cluster, which we call the ;meta-covariate' representation of the cluster, in a probit regression model. We first illustrate this method by analysing a leukaemia expression dataset, before focusing closely on the meta-covariate analysis of a renal gene expression dataset in a rat model of salt-sensitive hypertension. We explore the biological insights provided by our analysis of these data. In particular, we identify a highly influential cluster of 13 genes-including three transcription factors (Arntl, Bhlhe41 and Npas2)-that is implicated as being protective against hypertension in response to increased dietary sodium. Functional and canonical pathway analysis of this cluster using Ingenuity Pathway Analysis implicated transcriptional activation and circadian rhythm signalling, respectively. Although we illustrate our method using only expression data, the method is applicable to any high-dimensional datasets
Inference algorithms for gene networks: a statistical mechanics analysis
The inference of gene regulatory networks from high throughput gene
expression data is one of the major challenges in systems biology. This paper
aims at analysing and comparing two different algorithmic approaches. The first
approach uses pairwise correlations between regulated and regulating genes; the
second one uses message-passing techniques for inferring activating and
inhibiting regulatory interactions. The performance of these two algorithms can
be analysed theoretically on well-defined test sets, using tools from the
statistical physics of disordered systems like the replica method. We find that
the second algorithm outperforms the first one since it takes into account
collective effects of multiple regulators
Weighted-Lasso for Structured Network Inference from Time Course Data
We present a weighted-Lasso method to infer the parameters of a first-order
vector auto-regressive model that describes time course expression data
generated by directed gene-to-gene regulation networks. These networks are
assumed to own a prior internal structure of connectivity which drives the
inference method. This prior structure can be either derived from prior
biological knowledge or inferred by the method itself. We illustrate the
performance of this structure-based penalization both on synthetic data and on
two canonical regulatory networks, first yeast cell cycle regulation network by
analyzing Spellman et al's dataset and second E. coli S.O.S. DNA repair network
by analysing U. Alon's lab data
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features.
Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors
Methods for Epigenetic Analyses from Long-Read Sequencing Data
Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease.
DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity.
Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads.
With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another.
Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures.
Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data.
The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies.
Methods for storage, retrieval, and analysis of such data therefore require careful consideration.
Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation.
These avenues had not been considered in existing tools.
In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods.
I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information.
This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing.
It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface.
Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing.
This implementation takes advantage of the performance benefits provided by my high performance storage container.
It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties.
Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction.
I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions.
Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma.
I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures.
Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation.
These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding.
In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing
- …