1,154 research outputs found

    Spectral Sequence Motif Discovery

    Full text link
    Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Conformational dissection of a viral intrinsically disordered domain involved in cellular transformation

    Get PDF
    Intrinsic disorder is abundant in viral genomes and provides conformational plasticity to its protein products. In order to gain insight into its structure-function relationships, we carried out a comprehensive analysis of structural propensities within the intrinsically disordered N-terminal domain from the human papillomavirus type-16 E7 oncoprotein (E7N). Two E7N segments located within the conserved CR1 and CR2 regions present transient α-helix structure. The helix in the CR1 region spans residues L8 to L13 and overlaps with the E2F mimic linear motif. The second helix, located within the highly acidic CR2 region, presents a pH-dependent structural transition. At neutral pH the helix spans residues P17 to N29, which include the retinoblastoma tumor suppressor LxCxE binding motif (residues 21-29), while the acidic CKII-PEST region spanning residues E33 to I38 populates polyproline type II (PII) structure. At pH 5.0, the CR2 helix propagates up to residue I38 at the expense of loss of PII due to charge neutralization of acidic residues. Using truncated forms of HPV-16 E7, we confirmed that pH-induced changes in α-helix content are governed by the intrinsically disordered E7N domain. Interestingly, while at both pH the region encompassing the LxCxE motif adopts α-helical structure, the isolated 21-29 fragment including this stretch is unable to populate an α-helix even at high TFE concentrations. Thus, the E7N domain can populate dynamic but discrete structural ensembles by sampling α-helix-coil-PII-ß-sheet structures. This high plasticity may modulate the exposure of linear binding motifs responsible for its multi-target binding properties, leading to interference with key cell signaling pathways and eventually to cellular transformation by the virus.Instituto de Física de Líquidos y Sistemas Biológico

    Investigations into RNA-binding proteins involved in eukaryotic gene regulation

    Get PDF
    The flood of RNA-related research in recent decades has revealed RNA to be a structurally and functionally diverse class of molecule, one that generates an intricate network of regulation that has been pivotal to the evolution of complex lifeforms. In order to elucidate how RNA achieves biological function through the formation of ribonucleoprotein (RNP) complexes, characterisation of RNA recognition by RNA-binding proteins (RBPs) is an essential step. The rules governing the interaction of RNA and RBPs have proved difficult to define, and in many instances, it is not understood how specificity is achieved. Knowledge of these rules is crucial to our understanding of RNA-related functions and their role in disease, and requires further in-depth characterisation of a wide variety of RNP complexes. The research in this Thesis details the RNA-binding behaviour of two reported RBPs. Firstly, the RNA-binding behaviour of the Drosophila transcription factor bicoid is investigated. For many years it has been believed that the bicoid homeodomain binds the 3′-UTR of the caudal mRNA transcript, yet no binding site or specificity determinants have been reported. The work here attempts to characterise this interaction. Further, other domains in the protein are examined with a view to understanding how biological specificity might be achieved. Secondly, characterisation of the RNA-binding behaviour of the heterodimeric pair of transcription elongation factors, Spt4 and Spt5, is reported. This heterodimer is known to be an important player in transcription and yet remarkably little is known about its function. In the present work, the AA-repeat RNA-binding properties of these proteins are investigated, and complex binding behaviour is reported. Overall, it is shown that the elucidation of RNA-binding activity by proteins is often not straightforward, requiring the application of multiple and increasingly sophisticated techniques if we are to grasp the underlying biology

    Selected Works in Bioinformatics

    Get PDF
    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    Structural characterization of CYP144A1 - a cytochrome P450 enzyme expressed from alternative transcripts in Mycobacterium tuberculosis.

    Get PDF
    Mycobacterium tuberculosis (Mtb) causes the disease tuberculosis (TB). The virulent Mtb H37Rv strain encodes 20 cytochrome P450 (CYP) enzymes, many of which are implicated in Mtb survival and pathogenicity in the human host. Bioinformatics analysis revealed that CYP144A1 is retained exclusively within the Mycobacterium genus, particularly in species causing human and animal disease. Transcriptomic annotation revealed two possible CYP144A1 start codons, leading to expression of (i) a "full-length" 434 amino acid version (CYP144A1-FLV) and (ii) a "truncated" 404 amino acid version (CYP144A1-TRV). Computational analysis predicted that the extended N-terminal region of CYP144A1-FLV is largely unstructured. CYP144A1 FLV and TRV forms were purified in heme-bound states. Mass spectrometry confirmed production of intact, His6-tagged forms of CYP144A1-FLV and -TRV, with EPR demonstrating cysteine thiolate coordination of heme iron in both cases. Hydrodynamic analysis indicated that both CYP144A1 forms are monomeric. CYP144A1-TRV was crystallized and the first structure of a CYP144 family P450 protein determined. CYP144A1-TRV has an open structure primed for substrate binding, with a large active site cavity. Our data provide the first evidence that Mtb produces two different forms of CYP144A1 from alternative transcripts, with CYP144A1-TRV generated from a leaderless transcript lacking a 5'-untranslated region and Shine-Dalgarno ribosome binding site

    HSV-1 ICP4, A Multifaceted RNA PolII Transcription Factor

    Get PDF
    ICP4, of Herpes Simplex Virus type 1 (HSV-1) is responsible for activation of viral Early and Late genes, and is necessary for viral replication. ICP4 contains two transactivation domains separated by a DNA binding domain. The complex structure of ICP4 indicates the possible diversity of the cellular and viral proteins it interacts with to function. ICP4 interacts with a variety of transcription complexes to promote RNA Polymerase II mediated transcription. The structural basis for these interactions has not yet been clearly defined. To more closely examine the structural requirements for ICP4 activities, mutants in conserved and degenerate regions of the N-terminus, in the presence and absence of the carboxyl terminus, were examined for effects on viral gene expression. It was found that i) the amino terminal transactivation domain is strictly required for E gene transcription, ii) multiple conserved regions within the N-terminus contribute to transcription, and iii) the amino terminal and carboxyl terminal transactivation domains cooperate to mediate transcription. Affinity purification assays demonstrated that many of the observed defects in transcription probably resulted from the deletion of regions involved in stabilizing TFIID. Complementation analyses demonstrated that TFIID interactions are stabilized by the presence of one functional N-terminal and C-terminal transactivation domain within an ICP4 dimer. Affinity purification and mass spectrometry were used to determine the complexity of ICP4 mediated interactions throughout infection in addition to the structural requirements provided by ICP4 for these interactions. Mass spectrometry and western blot data indicated that ICP4 was found in complex with TFIID prior to other components of the transcription machinery including Mediator and TFIIH. Additionally, the amino terminal 774 amino acids were sufficient for interactions with TFIID, Mediator and TFIIH. While ICP4 has previously only been associated with preinitiation complex formation, components of initiation, elongation, mRNA processing, and mRNA export machinery were also found in complexes with ICP4, suggesting that ICP4 functions as a multifaceted RNA PolII transcription factor. Together, the data presented herein provide an understanding of how the structural complexities of ICP4 provide an interface for the formation of transcription complexes. Additionally, a new model for viral transcription is presented
    corecore