35 research outputs found

    Computational methods for RNA splicing

    Get PDF

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Understanding cellular differentiation by modelling of single-cell gene expression data

    Get PDF
    Over the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration

    piRNN: deep learning algorithm for piRNA prediction

    Get PDF
    Piwi-interacting RNAs (piRNAs) are the largest class of small non-coding RNAs discovered in germ cells. Identifying piRNAs from small RNA data is a challenging task due to the lack of conserved sequences and structural features of piRNAs. Many programs have been developed to identify piRNA from small RNA data. However, these programs have limitations. They either rely on extracting complicated features, or only demonstrate strong performance on transposon related piRNAs. Here we proposed a new program called piRNN for piRNA identification. For our software, we applied a convolutional neural network classifier that was trained on the datasets from four different species (Caenorhabditis elegans, Drosophila melanogaster, rat and human). A matrix of k-mer frequency values was used to represent each sequence. piRNN has great usability and shows better performance in comparison with other programs. It is freely available at https://github.com/bioinfolabmu/piRNN

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Seleção de embriões pela análise de imagens: uma abordagem Deep Learning

    Get PDF
    Infertility affects about 186 million people worldwide and 9-10% of couples in Portugal, causing financial, social and medical problems. Evaluation of embryo quality based morphological features is the standard in vitro fertilization (IVF) clinics around the world. This process is subjective and time-consuming, and results in discrepant classifications among embryologists and clinics, leading to fail in predict accurately embryo implantation and live birth potential. Although assisted reproductive technologies (ART) such as IVF coupled with time lapse elimination of periodic transfer to microscopy assessment and stable embryo culture conditions for embryos development, has alleviated the infertility problem, there are significant limitations even considering morphokinetic analysis. Likewise, many patients require multiple IVF cycles to achieve pregnancy, making the selection of single embryo for transfer a critical challenge. Here, we demonstrate the reliability of machine learning, especially deep learning based on TensorFlow open source and Keras libraries for embryo raw TLI images features extraction and classification in clinical practice. Equally, we present a follow up pipeline for clinicians and researchers, with no expertise in machine learning, to easily, rapid and accurately utilize deep learning as a clinical decision support tool in embryos viability studies, as well in other medical field where the analysis of images is preeminentA infertilidade afeta cerca de 186 milhões de pessoas em todo o mundo e 9-10% dos casais em Portugal, causando problemas financeiros, sociais e de saúde. Constitui procedimento padrão a avaliação da qualidade dos embriões baseadas em características morfológicas. No entanto, tais avaliações são subjetivas e demoradas e resultam em classificações discrepantes entre embriologistas e clínicas causando problemas na avaliação do potencial do embrião. Embora as tecnologias de reprodução medicamente assistida, como a fertilização in vitro, acoplada à tecnologia time-lapse, tenham diminuído o problema da infertilidade, existem limitações significativas, mesmo considerando a análise morfocinética. Outrossim, muitas pacientes necessitam de múltiplos ciclos de fertilização para alcançar a gravidez, tornando a seleção do embrião com maior potencial de implantação e geração de nados vivos um desafio crítico. No presente projeto demonstramos a prova do conceito da confiabilidade de Machine Learning (aprendizagem automática), especialmente Deep Learning baseado em TensorFlow e Keras, para extrair e discriminar caraterísticas associadas ao potencial embrionário, em imagens time-lapse. Igualmente, apresentamos um pipeline para que clínicos e investigadores, sem experiência em Machine Learning, possam utilizar com facilidade, rapidez e precisão Deep Learning como ferramenta de apoio à decisão clínica em estudos de viabilidade de embriões, bem como noutras áreas médicas onde a análise de imagens seja proeminenteMestrado em Biologia Molecular e Celula

    A SVM-based method to classify RBM20 affected and not affected exons

    Get PDF
    Mutations of RNA binding motif protein 20 (RBM20) have been recently reported to cause Human dilated cardiomyopathy (DCM) (Brauch et al., 2009, Li et al., 2010). DCM is the major cause of heart failure and mortality around the world (Jefferies and Towbin, 2010). Overall, 25\u201350% of DCM cases are familiar and causative mutations which have been described in more than 50 genes encoding mostly for structural components of cardiomyocytes. RBM20 belongs to the family of the SR and SR-related RNA binding proteins which assemble in the spliceosome taking part in the splicing of pre-mRNA. RBM20 is mainly expressed in striated muscle, with the highest levels in the heart (Guo et al., 2012). Due to its involvement in DCM, RBM20 was studied a lot to unveil its mechanism of action and its RNA targets (Guo et al., 2012, Li et al., 2013). Guo and colleagues reported a set of 31 genes showing a RBM20 dependent splicing from a whole transcriptome analysis in rats and humans (Guo et al., 2012). More recently, Maatz and colleagues reported an additional set of 18 rat genes and observed that RNA sequences recognized by RBM20 are likely to be located in the 400 nucleotides flanking the exons whose alternative splicing is regulated by RBM20 (Maatz et al., 2014). However, both the suggested RNA sequence which is recognized by RBM20 and its over-representation over the flanking regions of affected exons remain poor predictors to target genes presenting splicing events regulated by RBM20. The aim of this work was, thus, to characterize, through a bioinformatic approach, the sequence motifs of the exons whose alternative splicing was affected by RBM20, in order to ameliorate the prediction of the genes (exons) affected by RBM20. A differential expression analysis was performed to select the dataset of RBM20 affected exons; a further dataset was retrieved from literature data (Maatz et al., 2014). A Support Vector Machine (SVM) approach evaluating more kinds of genetic elements binding in the flanking regions of our target exons was used. A SVM method was chose to classify RBM20 affected and not affected exons, but other machine learning algorithms could have been used as well; however, SVM is among the most commonly used ones. From the analyses, our model resulted to well discriminate RBM20 affected from not affected exons. From a biological and functional point of view, this approach helps us to target novel candidate genes associated to diseases depending on a dysregulation of RBM20. This study provided additional information about RBM20 regulation of target exons, based not only on the RNA binding site, but also on other genetic elements associated to the binding site. Furthermore, we proposed the first model based on a SVM algorithm for the classification of RBM20 affected and not affected exons

    Use of genome sequencing to hunt for cryptic second-hit variants: analysis of 31 cases recruited to the 100 000 Genomes Project

    Get PDF
    Background: Current clinical testing methods used to uncover the genetic basis of rare disease have inherent limitations, which can lead to causative pathogenic variants being missed. Within the rare disease arm of the 100 000 Genomes Project (100kGP), families were recruited under the clinical indication ‘single autosomal recessive mutation in rare disease’. These participants presented with strong clinical suspicion for a specific autosomal recessive disorder, but only one suspected pathogenic variant had been identified through standard-of-care testing. Whole genome sequencing (WGS) aimed to identify cryptic ‘second-hit’ variants. Methods: To investigate the 31 families with available data that remained unsolved following formal review within the 100kGP, SVRare was used to aggregate structural variants present in <1% of 100kGP participants. Small variants were assessed using population allele frequency data and SpliceAI. Literature searches and publicly available online tools were used for further annotation of pathogenicity. Results: Using these strategies, 8/31 cases were solved, increasing the overall diagnostic yield of this cohort from 10/41 (24.4%) to 18/41 (43.9%). Exemplar cases include a patient with cystic fibrosis harbouring a novel exonic LINE1 insertion in CFTR and a patient with generalised arterial calcification of infancy with complex interlinked duplications involving exons 2–6 of ENPP1. Although ambiguous by short-read WGS, the ENPP1 variant structure was resolved using optical genome mapping and RNA analysis. Conclusion: Systematic examination of cryptic variants across a multi-disease cohort successfully identifies additional pathogenic variants. WGS data analysis in autosomal recessive rare disease should consider complex structural and small intronic variants as potentially pathogenic second hits

    Modeling meiotic recombination hotspots using deep learning

    Full text link
    La recombinaison méiotique joue un rôle essentiel dans la ségrégation des chromosomes pendant la méiose et dans la création de nouvelles combinaisons du matériel génétique des espèces. Ses effets cause une déviation du principe de l'assortiment indépendant de Mendel; cependant, les mécanismes moléculaires impliqués restent partiellement incompris jusqu'à aujourd'hui. Il s'agit d'un processus hautement régulé et de nombreuses protéines sont impliquées dans son contrôle, dirigeant la recombinaison méiotique dans des régions génomiques de 1 à 2 kilobases appelées « hotspots ». Au cours des dernières années, l'apprentissage profond a été appliqué avec succès à la classification des séquences génomiques. Dans ce travail, nous appliquons l'apprentissage profond aux séquences d'ADN humain afin de prédire si une région spécifique d'ADN est un hotspot de recombinaison méiotique ou non. Nous avons appliqué des réseaux de neurones convolutifs sur un ensemble de données décrivant les hotspots de quatre individus non-apparentés, atteignant une exactitude de plus de 88 % avec une précision et un rappel supérieur à 90 % pour les meilleurs modèles. Nous explorons l'impact de différentes tailles de séquences d'entrée, les stratégies de séparation des jeux d'entraînement/validation et l’utilité de montrer au modèle les coordonnées génomiques de la séquence d'entrée. Nous avons exploré différentes manières de construire les motifs appris par le réseau et comment ils peuvent être liés aux méthodes classiques de construction de matrices position-poids, et nous avons pu déduire des connaissances biologiques pertinentes découvertes par le réseau. Nous avons également développé un outil pour visualiser les différents modèles afin d'aider à interpréter les différents aspects du modèle. Dans l'ensemble, nos travaux montrent la capacité des méthodes d'apprentissage profond à étudier la recombinaison méiotique à partir de données génomiques.Meiotic recombination plays a critical role in the proper segregation of chromosomes during meiosis and in forming new combinations of genetic material within sexually-reproducing species. For a long time, its side effects were observed as a deviation from the Mendel’s principle of independent assortment; however, its molecular mechanisms remain only partially understood until today. We know that it is a highly regulated process and that many molecules are involved in this tight control, resulting in directing meiotic recombination into 1-2 kilobase genomic pairs regions called hotspots. During the past few years, deep learning was successfully applied to the classification of genomic sequences. In this work, we apply deep learning to DNA sequences in order to predict if a specific stretch of DNA is a meiotic recombination hotspot or not. We applied convolution neural networks on a dataset describing the hotspots of four unrelated male individuals, achieving an accuracy of over 88% with precision and recall above 90% for the best models. We explored the impact of different input sequence lengths, train/validation split strategies and showing the model the genomic coordinates of the input sequence. We explored different ways to construct the learnt motifs by the network and how they can relate to the classical methods of constructing position-weight-matrices, and we were able to infer relevant biological knowledge uncovered by the network. We also developed a tool for visualizing the different models output in order to help digest the different aspects of the model. Overall, our work shows the ability for deep learning methods to study meiotic recombination from genomic data

    From human genetics to radiobiology : in vitro radiosensitivity in individuals with a germline defect in DNA damage response genes

    Get PDF
    All currently known high to intermediate risk “breast cancer genes”, including BRCA1 and BRCA2, are involved in the DNA damage response pathway. Heterozygous germline mutations in these genes predispose to breast and ovarian cancer. In addition, such mutations may also result in enhanced radiosensitivity mediated by chromosomal instability after exposure to ionizing radiation, leading to a higher risk to develop radiation-induced breast cancer. However, results of currently available clinical studies evaluating carcinogenesis and in vitro studies comparing chromosomal radiosensitivity in mutation carriers and non-carriers are inconclusive. Nevertheless, insights into the radiosensitive phenotype of healthy tissues of mutation carriers is of the utmost importance for the safe use of ionizing radiation for diagnostic purposes or radiotherapy treatment. In this thesis, we evaluated in vitro radiosensitivity in carriers of a mutation in DNA damage response genes by means of two different assays. The first assay, the G2 micronucleus assay, is a cytogenetic assay in which MN are analyzed in cells irradiated in the G2 phase of the cell cycle. This assay was developed to evaluate radiosensitivity in cells with a heterozygous BRCA1 or BRCA2 mutation. BRCA1 and BRCA2 have a function in homologous recombination (HR), the main DNA double strand break repair pathway activated in late S and G2 phase of the cell cycle. Furthermore, BRCA1 is also involved in the G2/M cell cycle checkpoint. The G2 micronucleus assay allows evaluation of both functions by means of two distinct endpoints: (1) the radiation-induced micronucleus yield, which reflects DNA double strand break repair capacity and (2) the G2/M checkpoint efficiency ratio, which allows evaluation of the G2 arrest capacity. Before applying the G2 micronucleus assay on BRCA mutation carriers, the assay was validated in a patient with Ataxia Telangiectasia (AT). AT patients are characterized by a manifest increased radiosensitivity. AT patients show biallelic inactivation of ATM, involved in both DNA double strand break repair by means of HR and G2/M checkpoint activation. We demonstrated a severely increased radiosensitivity with both endpoints when applying the G2 micronucleus assay in lymphocytes of this AT patient. In lymphocytes of healthy relatives with a heterozygous ATM mutation the radiosensitivity observed with this assay was intermediate between the AT patient and the control cohort. When applying the G2 micronucleus assay on lymphocytes of healthy BRCA1/2 mutation carriers, we demonstrated significantly enhanced radiation-induced MN yields in both BRCA1 and BRCA2 germline mutation carriers, pointing to an impaired DNA double strand break repair capacity in both groups. Furthermore, an impaired G2 arrest capacity was observed in BRCA1 mutation carriers. In healthy relatives who did not inherit the familial mutation, no enhanced radiosensitivity was observed. Although a significantly enhanced radiosensitivity was demonstrated for the cohort of BRCA1 and BRCA2 mutation carriers compared to the control cohort, individual radiosensitivity evaluation was less straightforward due to overlap in micronucleus yields between both cohorts. Therefore, a scoring system to evaluate individual radiosensitivity was implemented. As both BRCA1 and BRCA2 are involved in HR, we evaluated if the accumulation of RAD51, a key protein involved in this pathway, at the double strand break site can be used to assess HR functionality and radiosensitivity. To this end, a radiation-induced RAD51 foci assay was optimized in a breast epithelial cell line (MCF10A) expressing ±50% reduced BRCA1 and BRCA2 protein levels, obtained by RNA interference. RAD51 foci were analyzed in cells synchronized in S phase by aphidicolin as HR is upregulated during this phase of the cell cycle. We demonstrated significantly reduced RAD51 foci formation, and thus impaired HR capacity, in response to the induction of radiation-induced double strand breaks in the BRCA knockdown cells compared to control cells. As no overlap in RAD51 foci distribution is observed between knockdown and control cells, we think that this assay could better differentiate between normal cells and cells with a heterozygous BRCA1 or BRCA2 mutation than the G2 micronucleus assay. This will be further explored in synchronized lymphocytes of heterozygous germline mutation carriers. In addition to the detection of unequivocal deleterious mutations in BRCA1 and BRCA2, variants of unknown clinical significance (VUS) are detected during diagnostic screening. The associated breast cancer risk is unknown, which creates a challenge for genetic counselling. mRNA analysis to assess variants that might impair proper RNA splicing, a highly regulated process, are widely used. We evaluated the outcome at cDNA level of 21 putative splicing variants in BRCA1 and BRCA2 and demonstrated aberrant splicing for 12 variants, suggesting that these are likely pathogenic. Furthermore, we demonstrated that in silico prediction tools might assist in the evaluation of these putative splicing variants. However, further optimization is warranted to allow reliable application outside the highly conserved consensus splice sites. The results obtained in this thesis may indicate that care should be taken when applying ionizing radiation for diagnostic or therapeutic purposes in individuals with a germline mutation in BRCA1 or BRCA2 as they may be at higher risk of developing radiation-induced breast cancer
    corecore