18 research outputs found
Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming
Several methods exist for predicting non-coding RNA (ncRNA) genes in Escherichia coli (E.coli). In addition to about sixty known ncRNA genes excluding tRNAs and rRNAs, various methods have predicted more than thousand ncRNA genes, but only 95 of these candidates were confirmed by more than one study. Here, we introduce a new method that uses automatic discovery of sequence patterns to predict ncRNA genes. The method predicts 135 novel candidates. In addition, the method predicts 152 genes that overlap with predictions in the literature. We test sixteen predictions experimentally, and show that twelve of these are actual ncRNA transcripts. Six of the twelve verified candidates were novel predictions. The relatively high confirmation rate indicates that many of the untested novel predictions are also ncRNAs, and we therefore speculate that E.coli contains more ncRNA genes than previously estimated
Distance constraints between microRNA target sites dictate efficacy and cooperativity
MicroRNAs (miRNAs) have the potential to regulate the expression of thousands of genes, but the mechanisms that determine whether a gene is targeted or not are poorly understood. We studied the genomic distribution of distances between pairs of identical miRNA seeds and found a propensity for moderate distances greater than about 13 nt between seed starts. Experimental data show that optimal down-regulation is obtained when two seed sites are separated by between 13 and 35 nt. By analyzing the distance between seed sites of endogenous miRNAs and transfected small interfering RNAs (siRNAs), we also find that cooperative targeting of sites with a separation in the optimal range can explain some of the siRNA off-target effects that have been reported in the literature
Toxicity in mice expressing short hairpin RNAs gives new insight into RNAi
Short hairpin RNAs can provide stable gene silencing via RNA interference. Recent studies have shown toxicity in vivo that appears to be related to saturation of the endogenous microRNA pathway. Will these findings limit the therapeutic use of such hairpins
Hardware-accelerated analysis of non-protein-coding RNAs
A tremendous amount of genomic sequence data of relatively high quality has become publicly available due to the human genome sequencing projects that were completed a few years ago. Despite considerable efforts, we do not yet know everything that is to know about the various parts of the genome, what all the regions code for, and how their gene products contribute in the myriad of biological processes that are performed within the cells. New high-performance methods are needed to extract knowledge from this vast amount of information. Furthermore, the traditional view that DNA codes for RNA that codes for protein, which is known as the central dogma of molecular biology, seems to be only part of the story. The discovery of many non-proteincoding gene families with housekeeping and regulatory functions brings an entirely new perspective to molecular biology. Also, sequence analysis of the new gene families require new methods, as there are significant differences between protein-coding and non-protein-coding genes. This work describes a new search processor that can search for complex patterns in sequence data for which no efficient lookup-index is known. When several chips are mounted on search cards that are fitted into PCs in a small cluster configuration, the system’s performance is orders of magnitude higher than that of comparable solutions for selected applications. The applications treated in this work fall into two main categories, namely pattern screening and data mining, and both take advantage of the search capacity of the cluster to achieve adequate performance. Specifically, the thesis describes an interactive system for exploration of all types of genomic sequence data. Moreover, a genetic programming-based data mining system finds classifiers that consist of potentially complex patterns that are characteristic for groups of sequences. The screening and mining capacity has been used to develop an algorithm for identification of new non-protein-coding genes in bacteria; a system for rational design of effective and specific short interfering RNA for sequence-specific silencing of protein-coding genes; and an improved algorithmic step for identification of new regulatory targets for the microRNA family of non-protein-coding genes.Paper V, VI, and VII are reprinted with kind permision of Elsevier, sciencedirect.co
Weighted sequence motifs as an improved seeding step in microRNA target prediction algorithms
We present a new microRNA target prediction algorithm called TargetBoost, and show that the algorithm is stable and identifies more true targets than do existing algorithms. TargetBoost uses machine learning on a set of validated microRNA targets in lower organisms to create weighted sequence motifs that capture the binding characteristics between microRNAs and their targets. Existing algorithms require candidates to have (1) near-perfect complementarity between microRNAs’ 5′ end and their targets; (2) relatively high thermodynamic duplex stability; (3) multiple target sites in the target’s 3′ UTR; and (4) evolutionary conservation of the target between species. Most algorithms use one of the two first requirements in a seeding step, and use the three others as filters to improve the method’s specificity. The initial seeding step determines an algorithm’s sensitivity and also influences its specificity. As all algorithms may add filters to increase the specificity, we propose that methods should be compared before such filtering. We show that TargetBoost’s weighted sequence motif approach is favorable to using both the duplex stability and the sequence complementarity steps. (TargetBoost is available as a Web tool from http://www.interagon.com/demo/.
Ola Snøve Jr. Hardware-accelerated analysis of
A tremendous amount of genomic sequence data of relatively high quality has become publicly available due to the human genome sequencing projects that were completed a few years ago. Despite considerable efforts, we do not yet know everything that is to know about the various parts of the genome, what all the regions code for, and how their gene products contribute in the myriad of biological processes that are performed within the cells. New high-performance methods are needed to extract knowledge from this vast amount of information. Furthermore, the traditional view that DNA codes for RNA that codes for protein, which is known as the central dogma of molecular biology, seems to be only part of the story. The discovery of many non-proteincoding gene families with housekeeping and regulatory functions brings an entirely new perspective to molecular biology. Also, sequence analysis of the new gene families require new methods, as there are significan