1,309 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Structural Prediction of Protein–Protein Interactions by Docking: Application to Biomedical Problems

    Get PDF
    A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of protein–protein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in protein–protein interactions, or providing modeled structural data for drug discovery targeting protein–protein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in Computational Biology.Peer ReviewedPostprint (author's final draft

    Using Bioinformatics Tools for Identification and Characterization of Transcriptome Derived EST-SSRs in Silver Fir (Abies alba Mill.)

    Get PDF
    Bioinformatics tools have been used to evaluate silver fir de novo assembled 454 transcriptome. A total of 3500 EST-SSRs were detected in the 454 transcriptome of silver fir. Most abundant are tri-nucleotide SSRs being followed by tetra- SSRs and di- SSRs. In addition, we determined the density, frequency, average length and average repeat number of EST-SSRs in the 454 transcriptome of silver fir

    J-SPACE: a Julia package for the simulation of spatial models of cancer evolution and of sequencing experiments

    Get PDF
    Background: The combined effects of biological variability and measurement-related errors on cancer sequencing data remain largely unexplored. However, the spatio-temporal simulation of multi-cellular systems provides a powerful instrument to address this issue. In particular, efficient algorithmic frameworks are needed to overcome the harsh trade-off between scalability and expressivity, so to allow one to simulate both realistic cancer evolution scenarios and the related sequencing experiments, which can then be used to benchmark downstream bioinformatics methods.Result: We introduce a Julia package for SPAtial Cancer Evolution (J-SPACE), which allows one to model and simulate a broad set of experimental scenarios, phenomenological rules and sequencing settings.Specifically, J-SPACE simulates the spatial dynamics of cells as a continuous-time multi-type birth-death stochastic process on a arbitrary graph, employing different rules of interaction and an optimised Gillespie algorithm. The evolutionary dynamics of genomic alterations (single-nucleotide variants and indels) is simulated either under the Infinite Sites Assumption or several different substitution models, including one based on mutational signatures. After mimicking the spatial sampling of tumour cells, J-SPACE returns the related phylogenetic model, and allows one to generate synthetic reads from several Next-Generation Sequencing (NGS) platforms, via the ART read simulator. The results are finally returned in standard FASTA, FASTQ, SAM, ALN and Newick file formats.Conclusion: J-SPACE is designed to efficiently simulate the heterogeneous behaviour of a large number of cancer cells and produces a rich set of outputs. Our framework is useful to investigate the emergent spatial dynamics of cancer subpopulations, as well as to assess the impact of incomplete sampling and of experiment-specific errors. Importantly, the output of J-SPACE is designed to allow the performance assessment of downstream bioinformatics pipelines processing NGS data. J-SPACE is freely available at: https://github.com/BIMIB-DISCo/J-Space.jl

    Using Bacterial Artificial Chromosomes in Leukemia Research: The Experience at the University Cytogenetics Laboratory in Brest, France

    Get PDF
    The development of the bacterial artificial chromosome (BAC) system was driven in part by the human genome project in order to construct genomic DNA libraries and physical maps for genomic sequencing. The availability of BAC clones has become a valuable tool for identifying cancer genes. We report here our experience in identifying genes located at breakpoints of chromosomal rearrangements and in defining the size and boundaries of deletions in hematological diseases. The methodology used in our laboratory consists of a three-step approach using conventional cytogenetics followed by FISH with commercial probes, then BAC clones. One limitation to the BAC system is that it can only accommodate inserts of up to 300 kb. As a consequence, analyzing the extent of deletions requires a large amount of material. Array comparative genomic hybridization (array-CGH) using a BAC/PAC system can be an alternative. However, this technique has limitations also, and it cannot be used to identify candidate genes at breakpoints of chromosomal rearrangements such as translocations, insertions, and inversions

    Finding subtypes of transcription factor motif pairs with distinct regulatory roles

    Get PDF
    DNA sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream-regulatory effects. Experimentally identified TF binding sites (TFBSs) are usually similar enough to be summarized by a ‘consensus’ motif, representative of the TF DNA binding specificity. Studies have shown that groups of nucleotide TFBS variants (subtypes) can contribute to distinct modes of downstream regulation by the TF via differential recruitment of cofactors. A TFA may bind to TFBS subtypes a1 or a2 depending on whether it associates with cofactors TFB or TFC, respectively. While some approaches can discover motif pairs (dyads), none address the problem of identifying ‘variants’ of dyads. TFs are key components of multiple regulatory pathways targeting different sets of genes perhaps with different binding preferences. Identifying the discriminating TF–DNA associations that lead to the differential downstream regulation is thus essential. We present DiSCo (Discovery of Subtypes and Cofactors), a novel approach for identifying variants of dyad motifs (and their respective target sequence sets) that are instrumental for differential downstream regulation. Using both simulated and experimental datasets, we demonstrate how current motif discovery can be successfully leveraged to address this question

    RNA-seq based SNP discovery in liver transcriptome of Polish Landrace pigs

    Get PDF
    Background: RNA-seq technology is most commonly used in quantitative measurement of gene expression levels and identification of non-annotated transcripts. It is also used for the coding SNPs (cSNPs) discoveries in an efficient and cost-effective way. The aim of this study was to identify the putative genetic cSNPs variants in liver transcriptome of Polish Landrace pigs fed with high and low (normal) omega-6 and omega-3 polyunsaturated fatty acids (PUFAs) diets.Methods: RNA-seq based NGS experiment was performed on Polish Landrace pigs fed with high and low PUFAs diets. Total RNA were isolated from liver tissues of low PUFAs (n=6) and high PUFAs dietary group (n=6) of Polish Landrace pigs. The RNA-seq libraries preparations were performed by mRNA enrichment, mRNA fragmentation, second strand cDNA synthesis, adaptor ligation, size selection and PCR amplification using the illumina TruSeq RNA Sample Prep Kit v2 (Illumina, San Diego CA, USA), followed by NGS sequencing on MiSeq illumina platform. The quality control (QC) of raw RNA-seq data of liver transcriptome was performed using the Trimmomatic and FastQC tools. The paired-end mapping of the liver transcriptome RNA-seq data (n=12) was performed on the reference genome Sus scrofa v.10.2, followed by cSNPs discovery using GATK and SAMtools bioinformatics SNPs caller tools.Results: Two pooled paired-end libraries of 151bp liver transcriptome of Polish Landrace pigs were generated from MiSeq instrument and subsequent Fastq RNA-seq data were submitted to NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra). Our study identified 25.3 million paired-end reads: representing 13,509,248 paired-end reads of high PUFAs dietary group and 11,815,696 paired-end reads of low PUFAs dietary group of Polish Landrace pigs liver transcriptome. The SNP discovery results revealed identification of 25909 homozygous and 23290 heterozygous cSNPs in the liver transcriptome of both dietary groups of Polish Landrace pigs. With regards to same or alternative SNPs alleles encoding amino acids regions, a total of 27141 synonymous cSNP and 5989 non-synonymous cSNPs were identified in liver transcriptome representing high PUFAs dietary group. However, a total of 15128 synonymous cSNPs and 3900 non-synonymous cSNPs were identified in liver transcriptome representing low PUFAs dietary groups of Polish Landrace pigs. The identification of single nucleotide variations (SNVs) representing substitutions of all four possibilities (A,T,G,C) were ranged 2872 to 6868 SNVs (high PUFAs) and 2574 to 3654 SNVs (low PUFAs) in the homozygous cSNPs and 2452 to 2678 SNVs (high PUFAs) and 2094 to 2230 SNVs (low PUFAs) in the heterozygous cSNPs of liver transcriptomes of Polish Landrace pigs, respectively.Conclusions: Study concluded that identification of cSNPs dataset representing the liver transcriptome of Polish Landrace pigs fed with a control diet (low) and pigs fed with a PUFAs diet (high) may be helpful to develop a new set of genetic markers for trait-associated studies (viz., growth and metabolic traits) specific to Polish Landrace pig breed. Such cSNP markers eventually can be utilized in the genetic improvement of the pig production traits using the genome-wide association studies (GWAS) and to finally implement on marker assisted selection (MAS) and genomics selection (GS) program in active breeding population of Polish Landrace pigs in Poland

    True single-cell proteomics using advanced ion mobility mass spectrometry

    Get PDF
    In this thesis, I present the development of a novel mass spectrometry (MS) platform and scan modes in conjunction with a versatile and robust liquid chromatography (LC) platform, which addresses current sensitivity and robustness limitations in MS-based proteomics. I demonstrate how this technology benefits the high-speed and ultra-high sensitivity proteomics studies on a large scale. This culminated in the first of its kind label-free MS-based single-cell proteomics platform and its application to spatial tissue proteomics. I also investigate the vastly underexplored ‘dark matter’ of the proteome, validating novel microproteins that contribute to human cellular function. First, we developed a novel trapped ion mobility spectrometry (TIMS) platform for proteomics applications, which multiplies sequencing speed and sensitivity by ‘parallel accumulation – serial fragmentation’ (PASEF) and applied it to first high-sensitivity and large-scale projects in the biomedical arena. Next, to explore the collisional cross section (CCS) dimension in TIMS, we measured over 1 million peptide CCS values, which enabled us to train a deep learning model for CCS prediction solely based on the linear amino acid sequence. We also translated the principles of TIMS and PASEF to the field of lipidomics, highlighting parallel benefits in terms of throughput and sensitivity. The core of my PhD is the development of a robust ultra-high sensitivity LC-MS platform for the high-throughput analysis of single-cell proteomes. Improvements in ion transfer efficiency, robust, very low flow LC and a PASEF data independent acquisition scan mode together increased measurement sensitivity by up to 100-fold. We quantified single-cell proteomes to a depth of up to 1,400 proteins per cell. A fundamental result from the comparisons to single-cell RNA sequencing data revealed that single cells have a stable core proteome, whereas the transcriptome is dominated by Poisson noise, emphasizing the need for both complementary technologies. Building on our achievements with the single-cell proteomics technology, we elucidated the image-guided spatial and cell-type resolved proteome in whole organs and tissues from minute sample amounts. We combined clearing of rodent and human organs, unbiased 3D-imaging, target tissue identification, isolation and MS-based unbiased proteomics to describe early-stage β-amyloid plaque proteome profiles in a disease model of familial Alzheimer’s. Automated artificial intelligence driven isolation and pooling of single cells of the same phenotype allowed us to analyze the cell-type resolved proteome of cancer tissues, revealing a remarkable spatial difference in the proteome. Last, we systematically elucidated pervasive translation of noncanonical human open reading frames combining state-of-the art ribosome profiling, CRISPR screens, imaging and MS-based proteomics. We performed unbiased analysis of small novel proteins and prove their physical existence by LC-MS as HLA peptides, essential interaction partners of protein complexes and cellular function

    Application of Semantics to Solve Problems in Life Sciences

    Get PDF
    Fecha de lectura de Tesis: 10 de diciembre de 2018La cantidad de información que se genera en la Web se ha incrementado en los últimos años. La mayor parte de esta información se encuentra accesible en texto, siendo el ser humano el principal usuario de la Web. Sin embargo, a pesar de todos los avances producidos en el área del procesamiento del lenguaje natural, los ordenadores tienen problemas para procesar esta información textual. En este cotexto, existen dominios de aplicación en los que se están publicando grandes cantidades de información disponible como datos estructurados como en el área de las Ciencias de la Vida. El análisis de estos datos es de vital importancia no sólo para el avance de la ciencia, sino para producir avances en el ámbito de la salud. Sin embargo, estos datos están localizados en diferentes repositorios y almacenados en diferentes formatos que hacen difícil su integración. En este contexto, el paradigma de los Datos Vinculados como una tecnología que incluye la aplicación de algunos estándares propuestos por la comunidad W3C tales como HTTP URIs, los estándares RDF y OWL. Haciendo uso de esta tecnología, se ha desarrollado esta tesis doctoral basada en cubrir los siguientes objetivos principales: 1) promover el uso de los datos vinculados por parte de la comunidad de usuarios del ámbito de las Ciencias de la Vida 2) facilitar el diseño de consultas SPARQL mediante el descubrimiento del modelo subyacente en los repositorios RDF 3) crear un entorno colaborativo que facilite el consumo de Datos Vinculados por usuarios finales, 4) desarrollar un algoritmo que, de forma automática, permita descubrir el modelo semántico en OWL de un repositorio RDF, 5) desarrollar una representación en OWL de ICD-10-CM llamada Dione que ofrezca una metodología automática para la clasificación de enfermedades de pacientes y su posterior validación haciendo uso de un razonador OWL
    corecore