1,240 research outputs found

    ChimeRScope: a novel alignment-free algorithm for fusion gene prediction using paired-end short reads

    Get PDF
    Fusion genes are those that result from the fusion of two or more genes, and they are typically generated due to the perturbations in the genome structure in cancer cells. In turn, fusion genes can contribute to tumor formation and progression by promoting the expression of an oncogene, deregulation of a tumor-suppressor, or producing much more active abnormal proteins. More importantly, oncogenic fusion genes are specifically expressed in the tumor cells, which provide enormous diagnostic and therapeutic advantages for cancer treatment. With the development of next-generation sequencing (NGS) technology, RNA-Seq becomes increasingly popular for transcriptomic study because of its high sensitivity and the capability of detecting novel transcripts including fusion genes. To date, many fusion gene detection tools have been developed, most of which attempt to find reliable alignment evidence for chimeric transcripts from RNA-Seq data. It is well accepted that the alignment quality of sequencing reads against the reference genome is often limited when significant differences in the genomes exist, which is the case with cancer genomes that contain many genomic perturbations and structural variations. Hence, regions where fusion genes occur in the cancer genome tend to be largely different from those in the reference genome, which prevents the alignment-based fusion gene detection methods from achieving good accuracies. We developed a tool called ChimeRScope. ChimeRScope, being an alignment-free method, bypasses the sequence alignment step by assessing the gene fingerprint profiles (in the form of k-mers) from RNA-Seq paired-end reads for fusion gene prediction (Chapter Two). We also optimized the data structure and ChimeRScope algorithms, in order to overcome the common limitations (memory-utilization, low accuracies) that are commonly seen in alignment-free methods (Chapter Two). Results on simulated datasets, previously studied cancer RNA-Seq datasets, and experimental validations on in-house datasets have shown that ChimeRScope consistently performed better than other popular alignment-based methods irrespective of the read length and depth of sequencing coverage (Chapter Three). ChimeRScope also generates graphical outputs for illustrations of the fusion patterns. Lastly, we also developed downloadable software for ChimeRScope and implemented an online data analysis server using the Galaxy platform (Chapter Four). ChimeRScope is available at https://github.com/ChimeRScope/ChimeRScope/

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Paired is better: local assembly algorithms for NGS paired reads and applications to RNA-Seq

    Get PDF
    The analysis of biological sequences is one of the main research areas of Bioinformatics. Sequencing data are the input for almost all types of studies concerning genomic as well as transcriptomic sequences, and sequencing experiments should be conceived specifically for each type of application. The challenges posed by fundamental biological questions are usually addressed by firstly aligning or assemblying the reads produced by new sequencing technologies. Assembly is the first step when a reference sequence is not available. Alignment of genomic reads towards a known genome is fundamental, e.g., to find the differences among organisms of related species, and to detect mutations proper of the so-called "diseases of the genome". Alignment of transcriptomic reads against a reference genome, allows to detect the expressed genes as well as to annotate and quantify alternative transcripts. In this thesis we overview the approaches proposed in literature for solving the above mentioned problems. In particular, we deeply analyze the sequence assembly problem, with particular emphasys on genome reconstruction, both from a more theoretical point of view and in light of the characteristics of sequencing data produced by state-of-the-art technologies. We also review the main steps in a pipeline for the analysis of the transcriptome, that is, alignment, assembly, and transcripts quantification, with particular emphasys on the opportunities given by RNA-Seq technologies in enhancing precision. The thesis is divided in two parts, the first one devoted to the study of local assembly methods for Next Generation Sequencing data, the second one concerning the development of tools for alignment of RNA-Seq reads and transcripts quantification. The permanent theme is the use of paired reads in all fields of applications discussed in this thesis. In particular, we emphasyze the benefits of assemblying inserts from paired reads in a wide range of applications, from de novo assembly, to the analysis of RNA. The main contribution of this thesis lies in the introduction of innovative tools, based on well-studied heuristics fine tuned on the data. Software is always tested to specifically assess the correctness of prediction. The aim is to produce robust methods, that, having low false positives rate, produce a certified output characterized by high specificity.openDottorato di ricerca in InformaticaopenNadalin, Francesc

    The art of PCR assay development: data-driven multiplexing

    Get PDF
    The present thesis describes the discovery and application of a novel methodology, named Data-Driven Multiplexing, which uses artificial intelligence and conventional molecular instruments to develop rapid, scalable and cost-effective clinical diagnostic tests. Detection of genetic material from living organisms is a biologically engineered process where organic molecules interact with each other and with chemical components to generate a meaningful signal of the presence, quantity or quality of target nucleic acids. Nucleic acid detection, such as DNA or RNA detection, identifies a specific organism based on its genetic material. In particular, DNA amplification approaches, such as for antimicrobial resistance (AMR) or COVID-19 detection, are crucial for diagnosing and managing various infectious diseases. One of the most widely used methods is Polymerase Chain Reaction (PCR), which can detect the presence of nucleic acids rapidly and accurately. The unique interaction of the genetic material and synthetic short DNA sequences called primers enable this harmonious biological process. This thesis aims to bioinformatically modulate the interaction between primers and genetic material, enhancing the diagnostic capabilities of conventional PCR instruments by applying artificial intelligence processing to the resulting signals. To achieve the goal mentioned above, experiments and data from several conventional platforms, such as real-time and digital PCR, are used in this thesis, along with state-of-the-art and innovative algorithms for classification problems and final application in real-world clinical scenarios. This work exhibits a powerful technology to optimise the use of the data, conveying the following message: the better use of the data in clinical diagnostics enables higher throughput of conventional instruments without the need for hardware modification, maintaining the standard practice workflows. In Part I, a novel method to analyse amplification data is proposed. Using a state-of-the-art digital PCR instrument and multiplex PCR assays, we demonstrate the simultaneous detection of up to nine different nucleic acids in a single-well and single-channel format. This novel concept called Amplification Curve Analysis (ACA) leverages kinetic information encoded in the amplification curve to classify the biological nature of the target of interest. This method is applied to the novel design of PCR assays for multiple detections of AMR genes and further validated with clinical samples collected at Charing Cross Hospital, London, UK. The ACA showed a high classification accuracy of 99.28% among 253 clinical isolates when multiplexing. Similar performance is also demonstrated with isothermal amplification chemistries using synthetic DNA, showing a 99.9% of classification accuracy for detecting respiratory-related infectious pathogens. In Part II, two intelligent mathematical algorithms are proposed to solve two significant challenges when developing a Data-driven multiplex PCR assay. Chapter 7 illustrates the use of filtering algorithms to remove the presence of outliers in the amplification data. This demonstrates that the information contained in the kinetics of the reaction itself provides a novel way to remove non-specific and not efficient reactions. By extracting meaningful features and adding custom selection parameters to the amplification data, we increase the machine learning classifier performance of the ACA by 20% when outliers are removed. In Chapter 8, a patented algorithm called Smart-Plexer is presented. This allows the hybrid development of multiplex PCR assays by computing the optimal single primer set combination in a multiplex assay. The algorithm's effectiveness stands in using experimental laboratory data as input, avoiding heavy computation and unreliable predictions of the sigmoidal shape of PCR curves. The output of the Smart-Plexer is an optimal assay for the simultaneous detection of seven coronavirus-related pathogens in a single well, scoring an accuracy of 98.8% in identifying the seven targets correctly among 14 clinical samples. Moreover, Chapter 9 focuses on applying novel multiplex assays in point-of-care devices and developing a new strategy for improving clinical diagnostics. In summary, inspired by the emerging requirement for more accurate, cost-effective and higher throughput diagnostics, this thesis shows that coupling artificial intelligence with assay design pipelines is crucial to address current diagnostic challenges. This requires crossing different fields, such as bioinformatics, molecular biology and data science, to develop an optimal solution and hence to maximise the value of clinical tests for nucleic acid detection, leading to more precise patient treatment and easier management of infectious control.Open Acces

    A Comprehensive Mapping of the Druggable Cavities within the SARS-CoV-2 Therapeutically Relevant Proteins by Combining Pocket and Docking Searches as Implemented in Pockets 2.0

    Get PDF
    (1) Background: Virtual screening studies on the therapeutically relevant proteins of the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) require a detailed characterization of their druggable binding sites, and, more generally, a convenient pocket mapping represents a key step for structure-based in silico studies; (2) Methods: Along with a careful literature search on SARS-CoV-2 protein targets, the study presents a novel strategy for pocket mapping based on the combination of pocket (as performed by the well-known FPocket tool) and docking searches (as performed by PLANTS or AutoDock/Vina engines); such an approach is implemented by the Pockets 2.0 plug-in for the VEGA ZZ suite of programs; (3) Results: The literature analysis allowed the identification of 16 promising binding cavities within the SARS-CoV-2 proteins and the here proposed approach was able to recognize them showing performances clearly better than those reached by the sole pocket detection; and (4) Conclusions: Even though the presented strategy should require more extended validations, this proved successful in precisely characterizing a set of SARS-CoV-2 druggable binding pockets including both orthosteric and allosteric sites, which are clearly amenable for virtual screening campaigns and drug repurposing studies. All results generated by the study and the Pockets 2.0 plug-in are available for download

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation

    Get PDF
    Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction

    Managing the sequence-specificity of antisense oligonucleotides in drug discovery

    Get PDF
    All drugs perturb the expression of many genes in the cells that are exposed to them. These gene expression changes can be divided into effects resulting from engaging the intended target and effects resulting from engaging unintended targets. For antisense oligonucleotides, developments in bioinformatics algorithms, and the quality of sequence databases, allow oligonucleotide sequences to be analyzed computationally, in terms of the predictability of their interactions with intended and unintended RNA targets. Applying these tools enables selection of sequence-specific oligonucleotides where no- or only few unintended RNA targets are expected. To evaluate oligonucleotide sequence-specificity experimentally, we recommend a transcriptomics protocol where two or more oligonucleotides targeting the same RNA molecule, but with entirely different sequences, are evaluated together. This helps to clarify which changes in cellular RNA levels result from downstream processes of engaging the intended target, and which are likely to be related to engaging unintended targets. As required for all classes of drugs, the toxic potential of oligonucleotides must be evaluated in cell- and animal models before clinical testing. Since potential adverse effects related to unintended targeting are sequence-dependent and therefore species-specific, in vitro toxicology assays in human cells are especially relevant in oligonucleotide drug discovery
    • …
    corecore