13,988 research outputs found

    Genomics and proteomics: a signal processor's tour

    Get PDF
    The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end

    De novo prediction of PTBP1 binding and splicing targets reveals unexpected features of its RNA recognition and function.

    Get PDF
    The splicing regulator Polypyrimidine Tract Binding Protein (PTBP1) has four RNA binding domains that each binds a short pyrimidine element, allowing recognition of diverse pyrimidine-rich sequences. This variation makes it difficult to evaluate PTBP1 binding to particular sites based on sequence alone and thus to identify target RNAs. Conversely, transcriptome-wide binding assays such as CLIP identify many in vivo targets, but do not provide a quantitative assessment of binding and are informative only for the cells where the analysis is performed. A general method of predicting PTBP1 binding and possible targets in any cell type is needed. We developed computational models that predict the binding and splicing targets of PTBP1. A Hidden Markov Model (HMM), trained on CLIP-seq data, was used to score probable PTBP1 binding sites. Scores from this model are highly correlated (Οβ€Š=β€Š-0.9) with experimentally determined dissociation constants. Notably, we find that the protein is not strictly pyrimidine specific, as interspersed Guanosine residues are well tolerated within PTBP1 binding sites. This model identifies many previously unrecognized PTBP1 binding sites, and can score PTBP1 binding across the transcriptome in the absence of CLIP data. Using this model to examine the placement of PTBP1 binding sites in controlling splicing, we trained a multinomial logistic model on sets of PTBP1 regulated and unregulated exons. Applying this model to rank exons across the mouse transcriptome identifies known PTBP1 targets and many new exons that were confirmed as PTBP1-repressed by RT-PCR and RNA-seq after PTBP1 depletion. We find that PTBP1 dependent exons are diverse in structure and do not all fit previous descriptions of the placement of PTBP1 binding sites. Our study uncovers new features of RNA recognition and splicing regulation by PTBP1. This approach can be applied to other multi-RRM domain proteins to assess binding site degeneracy and multifactorial splicing regulation

    Copy number variants and selective sweeps in natural populations of the house mouse (Mus musculus domesticus)

    Get PDF
    Copy–number variants (CNVs) may play an important role in early adaptations, potentially facilitating rapid divergence of populations. We describe an approach to study this question by investigating CNVs present in natural populations of mice in the early stages of divergence and their involvement in selective sweeps. We have analyzed individuals from two recently diverged natural populations of the house mouse (Mus musculus domesticus) from Germany and France using custom, high–density, comparative genome hybridization arrays (CGH) that covered almost 164 Mb and 2444 genes. One thousand eight hundred and sixty one of those genes we previously identified as differentially expressed between these populations, while the expression of the remaining genes was invariant. In total, we identified 1868 CNVs across all 10 samples, 200 bp to 600 kb in size and affecting 424 genic regions. Roughly two thirds of all CNVs found were deletions. We found no enrichment of CNVs among the differentially expressed genes between the populations compared to the invariant ones, nor any meaningful correlation between CNVs and gene expression changes. Among the CNV genes, we found cellular component gene ontology categories of the synapse overrepresented among all the 2444 genes tested. To investigate potential adaptive significance of the CNV regions, we selected six that showed large differences in frequency of CNVs between the two populations and analyzed variation in at least two microsatellites surrounding the loci in a sample of 46 unrelated animals from the same populations collected in field trappings. We identified two loci with large differences in microsatellite heterozygosity (Sfi1 and Glo1/Dnahc8 regions) and one locus with low variation across the populations (Cmah), thus suggesting that these genomic regions might have recently undergone selective sweeps. Interestingly, the Glo1 CNV has previously been implicated in anxiety–like behavior in mice, suggesting a differential evolution of a behavioral trai

    RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts.

    Get PDF
    Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    A Role for Pre-mRNA-PROCESSING PROTEIN 40C in the Control of Growth, Development, and Stress Tolerance in Arabidopsis thaliana

    Get PDF
    Because of their sessile nature, plants have adopted varied strategies for growing and reproducing in an ever-changing environment. Control of mRNA levels and pre-mRNA alternative splicing are key regulatory layers that contribute to adjust and synchronize plant growth and development with environmental changes. Transcription and alternative splicing are thought to be tightly linked and coordinated, at least in part, through a network of transcriptional and splicing regulatory factors that interact with the carboxyl-terminal domain (CTD) of the largest subunit of RNA polymerase II. One of the proteins that has been shown to play such a role in yeast and mammals is pre-mRNA-PROCESSING PROTEIN 40 (PRP40, also known as CA150, or TCERG1). In plants, members of the PRP40 family have been identified and shown to interact with the CTD of RNA Pol II, but their biological functions remain unknown. Here, we studied the role of AtPRP40C, in Arabidopsis thaliana growth, development and stress tolerance, as well as its impact on the global regulation of gene expression programs. We found that the prp40c knockout mutants display a late-flowering phenotype under long day conditions, associated with minor alterations in red light signaling. An RNA-seq based transcriptome analysis revealed differentially expressed genes related to biotic stress responses and also differentially expressed as well as differentially spliced genes associated with abiotic stress responses. Indeed, the characterization of stress responses in prp40c mutants revealed an increased sensitivity to salt stress and an enhanced tolerance to Pseudomonas syringae pv. maculicola (Psm) infections. This constitutes the most thorough analysis of the transcriptome of a prp40 mutant in any organism, as well as the first characterization of the molecular and physiological roles of a member of the PRP40 protein family in plants. Our results suggest that PRP40C is an important factor linking the regulation of gene expression programs to the modulation of plant growth, development, and stress responses.Fil: Hernando, Carlos Esteban. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; ArgentinaFil: GarcΓ­a Hourquet, Mariano. FundaciΓ³n Instituto Leloir; ArgentinaFil: de Leone, MarΓ­a JosΓ©. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; ArgentinaFil: Careno, Daniel Alejandro. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; ArgentinaFil: Iserte, Javier Alonso. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; ArgentinaFil: Mora Garcia, Santiago. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; ArgentinaFil: Yanovsky, Marcelo Javier. Consejo Nacional de Investigaciones CientΓ­ficas y TΓ©cnicas. Oficina de CoordinaciΓ³n Administrativa Parque Centenario. Instituto de Investigaciones BioquΓ­micas de Buenos Aires. FundaciΓ³n Instituto Leloir. Instituto de Investigaciones BioquΓ­micas de Buenos Aires; Argentin

    Recombination and its impact on the genome of the haplodiploid parasitoid wasp Nasonia

    Get PDF
    Homologous meiotic recombination occurs in most sexually reproducing organisms, yet its evolutionary advantages are elusive. Previous research explored recombination in the honeybee, a eusocial hymenopteran with an exceptionally high genome-wide recombination rate. A comparable study in a non-social member of the Hymenoptera that would disentangle the impact of sociality from Hymenoptera-specific features such as haplodiploidy on the evolution of the high genome-wide recombination rate in social Hymenoptera is missing. Utilizing single-nucleotide polymorphisms (SNPs) between two Nasonia parasitoid wasp genomes, we developed a SNP genotyping microarray to infer a high-density linkage map for Nasonia. The map comprises 1,255 markers with an average distance of 0.3 cM. The mapped markers enabled us to arrange 265 scaffolds of the Nasonia genome assembly 1.0 on the linkage map, representing 63.6% of the assembled N. vitripennis genome. We estimated a genome-wide recombination rate of 1.4-1.5 cM/Mb for Nasonia, which is less than one tenth of the rate reported for the honeybee. The local recombination rate in Nasonia is positively correlated with the distance to the center of the linkage groups, GC content, and the proportion of simple repeats. In contrast to the honeybee genome, gene density in the parasitoid wasp genome is positively associated with the recombination rate; regions of low recombination are characterized by fewer genes with larger introns and by a greater distance between genes. Finally, we found that genes in regions of the genome with a low recombination frequency tend to have a higher ratio of non-synonymous to synonymous substitutions, likely due to the accumulation of slightly deleterious non-synonymous substitutions. These findings are consistent with the hypothesis that recombination reduces interference between linked sites and thereby facilitates adaptive evolution and the purging of deleterious mutations. Our results imply that the genomes of haplodiploid and of diploid higher eukaryotes do not differ systematically in their recombination rates and associated parameters.Publisher PDFPeer reviewe

    Identification and evolutionary analysis of novel exons and alternative splicing events using cross-species EST-to-genome comparisons in human, mouse and rat

    Get PDF
    BACKGROUND: Alternative splicing (AS) is important for evolution and major biological functions in complex organisms. However, the extent of AS in mammals other than human and mouse is largely unknown, making it difficult to study AS evolution in mammals and its biomedical implications. RESULTS: Here we describe a cross-species EST-to-genome comparison algorithm (ENACE) that can identify novel exons for EST-scanty species and distinguish conserved and lineage-specific exons. The identified exons represent not only novel exons but also evolutionarily meaningful AS events that are not previously annotated. A genome-wide AS analysis in human, mouse and rat using ENACE reveals a total of 758 novel cassette-on exons and 167 novel retained introns that have no EST evidence from the same species. RT-PCR-sequencing experiments validated ~50 ~80% of the tested exons, indicating high presence of exons predicted by ENACE. ENACE is particularly powerful when applied to closely related species. In addition, our analysis shows that the ENACE-identified AS exons tend not to pass the nonsynonymous-to-synonymous substitution ratio test and not to contain protein domain, implying that such exons may be under positive selection or relaxed negative selection. These AS exons may contribute to considerable inter-species functional divergence. Our analysis further indicates that a large number of exons may have been gained or lost during mammalian evolution. Moreover, a functional analysis shows that inter-species divergence of AS events may be substantial in protein carriers and receptor proteins in mammals. These exons may be of interest to studies of AS evolution. The ENACE programs and sequences of the ENACE-identified AS events are available for download. CONCLUSION: ENACE can identify potential novel cassette exons and retained introns between closely related species using a comparative approach. It can also provide information regarding lineage- or species-specificity in transcript isoforms, which are important for evolutionary and functional studies
    • …
    corecore