104 research outputs found

    Combining Bayesian Approaches and Evolutionary Techniques for the Inference of Breast Cancer Networks

    Get PDF
    Gene and protein networks are very important to model complex large-scale systems in molecular biology. Inferring or reverseengineering such networks can be defined as the process of identifying gene/protein interactions from experimental data through computational analysis. However, this task is typically complicated by the enormously large scale of the unknowns in a rather small sample size. Furthermore, when the goal is to study causal relationships within the network, tools capable of overcoming the limitations of correlation networks are required. In this work, we make use of Bayesian Graphical Models to attach this problem and, specifically, we perform a comparative study of different state-of-the-art heuristics, analyzing their performance in inferring the structure of the Bayesian Network from breast cancer data

    StreamFlow: cross-breeding cloud with HPC

    Get PDF
    Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.Comment: 30 pages - 2020 IEEE Transactions on Emerging Topics in Computin

    A Logistic Model Tree Solution

    Get PDF
    Beretta, S., Castelli, M., Gonçalves, I., Kel, I., Giansanti, V., & Merelli, I. (2018). Improving eQTL Analysis Using a Machine Learning Approach for Data Integration: A Logistic Model Tree Solution. Journal of Computational Biology, 25(10), 1091-1105. DOI: 10.1089/cmb.2017.0167Expression quantitative trait loci (eQTL) analysis is an emerging method for establishing the impact of genetic variations (such as single nucleotide polymorphisms) on the expression levels of genes. Although different methods for evaluating the impact of these variations are proposed in the literature, the results obtained are mostly in disagreement, entailing a considerable number of false-positive predictions. For this reason, we propose an approach based on Logistic Model Trees that integrates the predictions of different eQTL mapping tools to produce more reliable results. More precisely, we employ a machine learning-based method using logistic functions to perform a linear regression able to classify the predictions of three eQTL analysis tools (namely, R/qtl, MatrixEQTL, and mRMR). Given the lack of a reference dataset and that computational predictions are not so easy to test experimentally, the performance of our approach is assessed using data from the DREAM5 challenge. The results show the quality of the aggregated prediction is better than that obtained by each single tool in terms of both precision and recall. We also performed a test on real data, employing genotypes and microRNA expression profiles from Caenorhabditis elegans, which proved that we were able to correctly classify all the experimentally validated eQTLs. These good results come both from the integration of the different predictions, and from the ability of this machine learning algorithm to find the best cutoff thresholds for each tool. This combination makes our integration approach suitable for improving eQTL predictions for testing in a laboratory, reducing the number of false-positive results.authorsversionpublishe

    The Genome Conformation As an Integrator of Multi-Omic Data: The Example of Damage Spreading in Cancer.

    Get PDF
    Publicly available multi-omic databases, in particular if associated with medical annotations, are rich resources with the potential to lead a rapid transition from high-throughput molecular biology experiments to better clinical outcomes for patients. In this work, we propose a model for multi-omic data integration (i.e., genetic variations, gene expression, genome conformation, and epigenetic patterns), which exploits a multi-layer network approach to analyse, visualize, and obtain insights from such biological information, in order to use achieved results at a macroscopic level. Using this representation, we can describe how driver and passenger mutations accumulate during the development of diseases providing, for example, a tool able to characterize the evolution of cancer. Indeed, our test case concerns the MCF-7 breast cancer cell line, before and after the stimulation with estrogen, since many datasets are available for this case study. In particular, the integration of data about cancer mutations, gene functional annotations, genome conformation, epigenetic patterns, gene expression, and metabolic pathways in our multi-layer representation will allow a better interpretation of the mechanisms behind a complex disease such as cancer. Thanks to this multi-layer approach, we focus on the interplay of chromatin conformation and cancer mutations in different pathways, such as metabolic processes, that are very important for tumor development. Working on this model, a variance analysis can be implemented to identify normal variations within each omics and to characterize, by contrast, variations that can be accounted to pathological samples compared to normal ones. This integrative model can be used to identify novel biomarkers and to provide innovative omic-based guidelines for treating many diseases, improving the efficacy of decision trees currently used in clinic

    The Human EST Ontology Explorer: a tissue-oriented visualization system for ontologies distribution in human EST collections

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The NCBI dbEST currently contains more than eight million human Expressed Sequenced Tags (ESTs). This wide collection represents an important source of information for gene expression studies, provided it can be inspected according to biologically relevant criteria. EST data can be browsed using different dedicated web resources, which allow to investigate library specific gene expression levels and to make comparisons among libraries, highlighting significant differences in gene expression. Nonetheless, no tool is available to examine distributions of quantitative EST collections in Gene Ontology (GO) categories, nor to retrieve information concerning library-dependent EST involvement in metabolic pathways. In this work we present the Human EST Ontology Explorer (HEOE) <url>http://www.itb.cnr.it/ptp/human_est_explorer</url>, a web facility for comparison of expression levels among libraries from several healthy and diseased tissues.</p> <p>Results</p> <p>The HEOE provides library-dependent statistics on the distribution of sequences in the GO Direct Acyclic Graph (DAG) that can be browsed at each GO hierarchical level. The tool is based on large-scale BLAST annotation of EST sequences. Due to the huge number of input sequences, this BLAST analysis was performed with the aid of grid computing technology, which is particularly suitable to address data parallel task. Relying on the achieved annotation, library-specific distributions of ESTs in the GO Graph were inferred. A pathway-based search interface was also implemented, for a quick evaluation of the representation of libraries in metabolic pathways. EST processing steps were integrated in a semi-automatic procedure that relies on Perl scripts and stores results in a MySQL database. A PHP-based web interface offers the possibility to simultaneously visualize, retrieve and compare data from the different libraries. Statistically significant differences in GO categories among user selected libraries can also be computed.</p> <p>Conclusion</p> <p>The HEOE provides an alternative and complementary way to inspect EST expression levels with respect to approaches currently offered by other resources. Furthermore, BLAST computation on the whole human EST dataset was a suitable test of grid scalability in the context of large-scale bioinformatics analysis. The HEOE currently comprises sequence analysis from 70 non-normalized libraries, representing a comprehensive overview on healthy and unhealthy tissues. As the analysis procedure can be easily applied to other libraries, the number of represented tissues is intended to increase.</p

    PWHATSHAP: efficient haplotyping for future generation sequencing

    Get PDF
    Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the con dence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e. coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments.&nbsp; Results: Given the potential relevance of ecient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard le formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage.&nbsp; Conclusions: Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the eciency of the analysis of genomics information
    • …