17 research outputs found

    A Posterior Probability Approach for Gene Regulatory Network Inference in Genetic Perturbation Data

    Full text link
    Inferring gene regulatory networks is an important problem in systems biology. However, these networks can be hard to infer from experimental data because of the inherent variability in biological data as well as the large number of genes involved. We propose a fast, simple method for inferring regulatory relationships between genes from knockdown experiments in the NIH LINCS dataset by calculating posterior probabilities, incorporating prior information. We show that the method is able to find previously identified edges from TRANSFAC and JASPAR and discuss the merits and limitations of this approach

    Evaluation and improvement of the regulatory inference for large co-expression networks with limited sample size

    Get PDF
    Abstract Background Co-expression has been widely used to identify novel regulatory relationships using high throughput measurements, such as microarray and RNA-seq data. Evaluation studies on co-expression network analysis methods mostly focus on networks of small or medium size of up to a few hundred nodes. For large networks, simulated expression data usually consist of hundreds or thousands of profiles with different perturbations or knock-outs, which is uncommon in real experiments due to their cost and the amount of work required. Thus, the performances of co-expression network analysis methods on large co-expression networks consisting of a few thousand nodes, with only a small number of profiles with a single perturbation, which more accurately reflect normal experimental conditions, are generally uncharacterized and unknown. Methods We proposed a novel network inference methods based on Relevance Low order Partial Correlation (RLowPC). RLowPC method uses a two-step approach to select on the high-confidence edges first by reducing the search space by only picking the top ranked genes from an intial partial correlation analysis and, then computes the partial correlations in the confined search space by only removing the linear dependencies from the shared neighbours, largely ignoring the genes showing lower association. Results We selected six co-expression-based methods with good performance in evaluation studies from the literature: Partial correlation, PCIT, ARACNE, MRNET, MRNETB and CLR. The evaluation of these methods was carried out on simulated time-series data with various network sizes ranging from 100 to 3000 nodes. Simulation results show low precision and recall for all of the above methods for large networks with a small number of expression profiles. We improved the inference significantly by refinement of the top weighted edges in the pre-inferred partial correlation networks using RLowPC. We found improved performance by partitioning large networks into smaller co-expressed modules when assessing the method performance within these modules. Conclusions The evaluation results show that current methods suffer from low precision and recall for large co-expression networks where only a small number of profiles are available. The proposed RLowPC method effectively reduces the indirect edges predicted as regulatory relationships and increases the precision of top ranked predictions. Partitioning large networks into smaller highly co-expressed modules also helps to improve the performance of network inference methods. The RLowPC R package for network construction, refinement and evaluation is available at GitHub: https://github.com/wyguo/RLowPC

    Model-based clustering with data correction for removing artifacts in gene expression data

    Full text link
    The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.Comment: 28 page

    Gene Regulatory Network Inference Using Machine Learning Techniques

    Get PDF
    Systems Biology is a field that models complex biological systems in order to better understand the working of cells and organisms. One of the systems modeled is the gene regulatory network that plays the critical role of controlling an organism's response to changes in its environment. Ideally, we would like a model of the complete gene regulatory network. In recent years, several advances in technology have permitted the collection of an unprecedented amount and variety of data such as genomes, gene expression data, time-series data, and perturbation data. This has stimulated research into computational methods that reconstruct, or infer, models of the gene regulatory network from the data. Many solutions have been proposed, yet there remain open challenges in utilising the range of available data as it is inherently noisy, and must be integrated by the inference techniques. The thesis seeks to contribute to this discourse by investigating challenges of performance, scale, and data integration. We propose a new algorithm BENIN that views network inference as feature selection to address issues of scale, that uses elastic net regression for improved performance, and adapts elastic net to integrate different types of biological data. The BENIN algorithm is benchmarked on a synthetic dataset from the DREAM4 challenge, and on real expression data for the human HeLa cell cycle. On the DREAM4 dataset BENIN out-performed all DREAM4 competitors on the size 100 subchallenge, and is also competitive with more recent state-of-the-art methods. Moreover, on the HeLa cell cycle data, BENIN could infer known regulatory interactions and propose new interactions that warrant further experimental investigation. Keys words: gene regulatory network, network inference, feature selection, elastic net regression

    Learning the structure of Bayesian Networks: A quantitative assessment of the effect of different algorithmic schemes

    Full text link
    One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is NP-hard. Hence, full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a quantitative analysis of the performance and characteristics of the different heuristics to solve this problem has never been done before. For this reason, in this work, we provide a detailed comparison of many different state-of-the-arts methods for structural learning on simulated data considering both BNs with discrete and continuous variables, and with different rates of noise in the data. In particular, we investigate the performance of different widespread scores and algorithmic approaches proposed for the inference and the statistical pitfalls within them

    Inference of regulatory networks with a convergence improved MCMC sampler

    Get PDF

    Development and evaluation of machine learning algorithms for biomedical applications

    Get PDF
    Gene network inference and drug response prediction are two important problems in computational biomedicine. The former helps scientists better understand the functional elements and regulatory circuits of cells. The latter helps a physician gain full understanding of the effective treatment on patients. Both problems have been widely studied, though current solutions are far from perfect. More research is needed to improve the accuracy of existing approaches. This dissertation develops machine learning and data mining algorithms, and applies these algorithms to solve the two important biomedical problems. Specifically, to tackle the gene network inference problem, the dissertation proposes (i) new techniques for selecting topological features suitable for link prediction in gene networks; a graph sparsification method for network sampling; (iii) combined supervised and unsupervised methods to infer gene networks; and (iv) sampling and boosting techniques for reverse engineering gene networks. For drug sensitivity prediction problem, the dissertation presents (i) an instance selection technique and hybrid method for drug sensitivity prediction; (ii) a link prediction approach to drug sensitivity prediction; a noise-filtering method for drug sensitivity prediction; and (iv) transfer learning approaches for enhancing the performance of drug sensitivity prediction. Substantial experiments are conducted to evaluate the effectiveness and efficiency of the proposed algorithms. Experimental results demonstrate the feasibility of the algorithms and their superiority over the existing approaches

    Distributed Bayesian networks reconstruction on the whole genome scale

    Get PDF
    Background Bayesian networks are directed acyclic graphical models widely used to represent the probabilistic relationships between random variables. They have been applied in various biological contexts, including gene regulatory networks and protein–protein interactions inference. Generally, learning Bayesian networks from experimental data is NP-hard, leading to widespread use of heuristic search methods giving suboptimal results. However, in cases when the acyclicity of the graph can be externally ensured, it is possible to find the optimal network in polynomial time. While our previously developed tool BNFinder implements polynomial time algorithm, reconstructing networks with the large amount of experimental data still leads to computations on single CPU growing exceedingly. Results In the present paper we propose parallelized algorithm designed for multi-core and distributed systems and its implementation in the improved version of BNFinder—tool for learning optimal Bayesian networks. The new algorithm has been tested on different simulated and experimental datasets showing that it has much better efficiency of parallelization than the previous version. BNFinder gives comparable results in terms of accuracy with respect to current state-of-the-art inference methods, giving significant advantage in cases when external information such as regulators list or prior edge probability can be introduced, particularly for datasets with static gene expression observations. Conclusions We show that the new method can be used to reconstruct networks in the size range of thousands of genes making it practically applicable to whole genome datasets of prokaryotic systems and large components of eukaryotic genomes. Our benchmarking results on realistic datasets indicate that the tool should be useful to a wide audience of researchers interested in discovering dependencies in their large-scale transcriptomic datasets

    Inferring sparse networks for noisy transient processes

    Get PDF
    Inferring causal structures of real world complex networks from measured time series signals remains an open issue. The current approaches are inadequate to discern between direct versus indirect influences (i.e., the presence or absence of a directed arc connecting two nodes) in the presence of noise, sparse interactions, as well as nonlinear and transient dynamics of real world processes. We report a sparse regression (referred to as the [Image: see text]-min) approach with theoretical bounds on the constraints on the allowable perturbation to recover the network structure that guarantees sparsity and robustness to noise. We also introduce averaging and perturbation procedures to further enhance prediction scores (i.e., reduce inference errors), and the numerical stability of [Image: see text]-min approach. Extensive investigations have been conducted with multiple benchmark simulated genetic regulatory network and Michaelis-Menten dynamics, as well as real world data sets from DREAM5 challenge. These investigations suggest that our approach can significantly improve, oftentimes by 5 orders of magnitude over the methods reported previously for inferring the structure of dynamic networks, such as Bayesian network, network deconvolution, silencing and modular response analysis methods based on optimizing for sparsity, transients, noise and high dimensionality issues
    corecore