222 research outputs found

    KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

    Get PDF
    Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    Explainable deep learning models for biological sequence classification

    Get PDF
    Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms. This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates. Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice

    NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

    Get PDF
    Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

    Feature selection for classification of nucleic acid sequences

    Get PDF

    Domain adaptation algorithms for biological sequence classification

    Get PDF
    Doctor of PhilosophyDepartment of Computing and Information SciencesDoina CarageaThe large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Detection and characterisation of RNA processing variation from deep RNA sequencing data

    Get PDF
    The introduction of high-throughput sequencing technologies has opened unprecedented opportunities for research on the regulation of ribonucleic acid (RNA) processing, which is central to cellular information processing. By enabling accurate and extensive measurements of various properties of cellular RNAs, these techniques allow to systematically investigate the transcriptome and its regulation on a genome-wide scale. The development of computational methods to analyse the resulting data, however, is still lagging behind the advances in experimental data generation. In this thesis, we present novel approaches to leverage the potential of high-throughput sequencing technologies for studying the regulation of RNA processing. More specifically, we focused on the following three research problems: First, we investigated how to best extract information from RNA-sequencing (RNA-Seq) data and how to design RNA-Seq experiments in order to maximise their utility for answering the investigated question. For this purpose, we derived a probabilistic model to estimate the utility of RNA-Seq experiments as a function of the experimental parameters for typical analyses such as the identification of transcripts and the detection of differential splicing. Application of our models provided fundamental, experimentally supported insights into how particular experimental parameters influence the amount of information gained from an RNA-Seq experiment. Based on these insights, we suggest strategies for an improved experimental design of transcriptome analysis experiments. The second investigated aspect was the detection of differential RNA processing based on high-throughput sequencing data. Here, we proposed novel statistical tests to detect changes in RNA processing for two distinct settings: When the gene annotation is complete (which is often the case for model organism) and for the case where the gene annotation is incomplete or unknown (as it is the case for non-model organism or pathological phenotypes). We showed that both on realistically simulated and on experimental data our newly developed tests out-competed state-of-the-art methods. Furthermore, we showed how our methods could be extended to detect differential RNA secondary structure and to associate changes in RNA processing with genetic variation. Finally, we successfully applied our methods to investigate the role of splicing in human cancer cells, to understand mechanisms of nonsense mediated decay in A. thaliana and to reveal regulatory structural motives of translation in human. The third investigated aspect was the characterisation of changes in RNA processing. We showed that combining RNA-Seq data with information on genomic variation and transcription factor binding preferences explained causes of gene expression variation. For this, we first performed a comprehensive analysis of gene expression landscape in an A. thaliana population. Furthermore, we showed that there is a significant enrichment of genetic variants associated with gene expression in predicted transcription factor binding sites. Finally, we showed that alterations of transcription factor binding sites are a major driver of gene expression variation. Overall, we addressed different aspects of the detection and characterisation of RNA processing. Using our new methods we have gained novel insights into the regulation of RNA processing. However, the work has also shown that there are still several open questions, which should be addressed in future studies.Die Regulierung der RibonukleinsĂ€ure (RNS)-Prozessierung ist von zentraler Bedeutung fĂŒr die zellulĂ€re Informationsverarbeitung. Die EinfĂŒhrung von Technologien zur Hochdurchsatzsequenzierung (HTS) hat zur weiteren Erforschung dieses Gebietes neue Chancen eröffnet. Da diese Techniken umfangreiche und genaue Messungen verschiedener Eigenschaften der zellulĂ€ren RNSs erlauben, ermöglichen sie die genomweite systematische Untersuchung des Transkriptoms und dessen Regulierung. Die Entwicklung von Methoden zur Analyse der resultierenden Daten ist jedoch nicht so fortgeschritten wie die experimentellen Datenerzeugung. In unserer Arbeit prĂ€sentieren wir neue AnsĂ€tze, um das Potenzial der HTS zur Untersuchung der Regulation der RNS-Prozessierung auszuschöpfen. Hierbei konzentrierten wir uns auf die folgenden drei Aspekte: Zum ersten, wie Informationen aus den RNS-Sequenzierungs (RNS-Seq)-Daten extrahiert werden können und wie RNS-Seq-Experimente konzipiert werden mĂŒssen, um einen maximalen Nutzen zu generieren. Zu diesem Zweck haben wir, abhĂ€ngig von den Parametern des jeweiligen Experiments, probabilistische Modelle hergeleitet, um die NĂŒtzlichkeit der RNS-Seq- Experimente fĂŒr gĂ€ngige Analysen, wie beispielsweise die Identifizierung von Transkripten und die Erkennung von differentiellem Spleissen, zu bestimmen. Die Anwendung unserer Modelle ermöglicht es, grundsĂ€tzliche, durch experimentelle Daten bestĂ€tigte Einsichten zu erlangen, wie die experimentellen Parameter den Informationsgewinn von RNS-Seq-Experimenten beeinflussen. Auf diesen Erkenntnissen basierend, schlagen wir verbesserte VersuchsplĂ€ne fĂŒr Experimente zur Transkriptomanalyse vor. Der zweite Aspekt war die Erkennung von Änderungen in der RNS-Prozessierung mit Hilfe von HTS-Daten. Hier prĂ€sentieren wir neuartige statistische Tests, um in zwei verschiedenen Anwendungsgebieten Änderungen in der RNS-Prozessierung zu detektieren: (a) fĂŒr den Fall der vollstĂ€ndigen Genannotation, was oft bei Modellorganismen zutrifft, aber auch (b) fĂŒr den Fall dass die Genannotation unvollstĂ€ndig oder unbekannt ist. Letzteres ist hĂ€ufig bei Nicht-Modellorganismen oder pathologische PhĂ€notypen der Fall. In dieser Arbeit konnten wir zeigen, dass unsere neu entwickelten Tests anderen modernen Methoden ĂŒberlegen waren, sowohl bei Anwendung auf realistisch simulierten als auch auf experimentellen Daten. DarĂŒber hinaus zeigten wir, wie unsere Methoden erweitert werden können, um Unterschiede in RNS-SekundĂ€rstrukturen zu erkennen und auch um differentielle RNS-Prozessierung mit genetischer Variation zu assoziieren. Schliesslich konnten wir zeigen, wie unsere Methoden angewandt werden können, um erstens die Rolle des Spleissens in menschlichen Krebszellen zu untersuchen, zweitens die dem Nonsense Mediated Decay zugrunde liegenden Mechanismen zu verstehen und drittens regulatorische Strukturmotive der Translation im Menschen zu entdecken. Der letzte Aspekt war die Charakterisierung von Änderungen der RNS-Prozessierung. Wir konnten zeigen, dass die gemeinsame Verwendung von RNS-Seq-Daten mit Informationen zur genomischen Variation und Transkriptionsfaktor (TF)-BindungsprĂ€ferenzen ermöglicht, den Mechanismus der VerĂ€nderung der Genexpression besser zu verstehen. Dazu haben wir zunĂ€chst eine umfassende Analyse der Genexpression in einer A. thaliana Population durchgefĂŒhrt. Ausserdem haben wir demonstriert, dass eine signifikante Anreicherung von mit Genexpression assoziierten genetischen Varianten in vorhergesagten TF-Bindestellen (TFBS) vorhanden war. Zuletzt haben wir gezeigt, dass VerĂ€nderungen in den TFBS in Promotoren eine bedeutende Ursache von Genexpressionsvariation waren. Zusammenfassend haben wir unterschiedliche Aspekte der Detektion und Charakterisierung von RNS-Prozessierung untersucht. Mit Hilfe unserer neu entwickelten Methoden haben wir neue Einsichten in die Regulation von RNS-Prozessierung erhalten. Unsere Arbeit zeigte jedoch, dass es immer noch viele offene Fragestellungen gibt, welche in zukĂŒnftigen Untersuchungen behandelt werden sollten
    • 

    corecore