23 research outputs found
Mining Phenotypes for Protein Function Prediction
Until very recently, phenotypes only very rarely were studied in a systematic manner. While ontologies for describing gene functions now have a 10 year long tradition, similar vocabularies for describing the phenotype of genes are only emerging now; similarly, the techniques for determining phenotypes on a large scale (especially RNAi) are available only for a few years, while genomic sequencing or gene expression studies are already established for a much longer time.
In this talk, we describe results from a study for exploiting phenotype descriptions for protein function prediction. We used the data from PhenomicsDB, a phenotype database integrated from several publicly available data sources. Due to the lack of standardization, phenotypes in PhenomicsDB can only be viewed as text (short statements, abstracts, singular terms, ...). We clustered these texts and analyzed the corresponding gene clusters in terms of their coherence in functional annotation and their interconnectedness by protein-protein-interactions. We also devised a method for using the close similarity in their phenotype descriptions to predict the function of proteins. We show that this methods yields a very good precision at acceptable coverage
PhenomicDB: a new cross-species genotype/phenotype resource
Phenotypes are an important subject of biomedical research for which many repositories have already been created. Most of these databases are either dedicated to a single species or to a single disease of interest. With the advent of technologies to generate phenotypes in a high-throughput manner, not only is the volume of phenotype data growing fast but also the need to organize these data in more useful ways. We have created PhenomicDB (freely available at ), a multi-species genotype/phenotype database, which shows phenotypes associated with their corresponding genes and grouped by gene orthologies across a variety of species. We have enhanced PhenomicDB recently by additionally incorporating quantitative and descriptive RNA interference (RNAi) screening data, by enabling the usage of phenotype ontology terms and by providing information on assays and cell lines. We envision that integration of classical phenotypes with high-throughput data will bring new momentum and insights to our understanding. Modern analysis tools under development may help exploiting this wealth of information to transform it into knowledge and, eventually, into novel therapeutic approaches
Mining phenotypes for gene function prediction
<p>Abstract</p> <p>Background</p> <p>Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships.</p> <p>Results</p> <p>We present results on a study where we use a large set of phenotype data – in textual form – to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.</p> <p>Conclusion</p> <p>The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.</p
Deciphering Seed Sequence Based Off-Target Effects in a Large-Scale RNAi Reporter Screen for E-Cadherin Expression
<div><p>Functional RNAi based screening is affected by large numbers of false positive and negative hits due to prevalent sequence based off-target effects. We performed a druggable genome targeting siRNA screen intended to identify novel regulators of E-cadherin (CDH1) expression, a known key player in epithelial mesenchymal transition (EMT). Analysis of primary screening results indicated a large number of false-positive hits. To address these crucial difficulties we developed an analysis method, SENSORS, which, similar to published methods, is a seed enrichment strategy for analyzing siRNA off-targets in RNAi screens. Using our approach, we were able to demonstrate that accounting for seed based off-target effects stratifies primary screening results and enables the discovery of additional screening hits. While traditional hit detection methods are prone to false positive results which are undetected, we were able to identify false positive hits robustly. Transcription factor MYBL1 was identified as a putative novel target required for CDH1 expression and verified experimentally. No siRNA pool targeting MYBL1 was present in the used siRNA library. Instead, MYBL1 was identified as a putative CDH1 regulating target solely based on the SENSORS off-target score, i.e. as a gene that is a cause for off-target effects down regulating E-cadherin expression.</p></div
CDK5R1 false positive prediction and validation.
<p>(<b>A</b>) ZEB1 and KRAS were the most significant off-targets in our screen causing an E-cadherin up regulation while CDH1 and MYBL1 are strong negative off-targets causing a loss of E-cadherin expression. The red dashed line is the hit threshold for primary screening data (shown on the y-axis). Pools that fell within the orange zone (i.e. pools showing a primary score above the primary screen threshold but that have no significant off-target z-score) and that have at least one seed matching into the strong positive off-targets are considered likely false positives (red circles). These pools were deconvoluted and validated experimentally. (<b>B</b>) Common seed analysis for the CDK5R1 pool. While no other siRNAs with the seed sequence GTACCTC exhibited a significant phenotypic score, some of the siRNAs with the seed sequence AACAATG (match in ZEB1 3’UTR) showed a similar phenotype to the CDK5R1 pool (red points). One seed sequence is only present in the CDK5R1 pool. (<b>C</b>) Deconvolution of CDK5R1 siRNAs. The siRNA containing the seed AACAATG (si16899) was the only one showing a significant up regulation of E-cadherin expression, while all other siRNAs targeted against CDK5R1 showed no phenotype. (<b>D</b>) C911 control. The C911 control for si16899 kept the phenotype of the unaltered siRNA, indicating that the observed phenotype is due to a seed sequence-mediated off-target effect. The ZEB1 C911 siRNA showed no phenotype indicating that the ZEB1 phenotype is a true positive (on-target) result.</p
Seed based off-target effects in pooled siRNA screens.
<p>(<b>A</b>) An on-target siRNA match is generally understood as a perfect match of nucleotides 1–19 of an 21 nucleotide long siRNA guide strand within the coding sequence of an intended transcript [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0137640#pone.0137640.ref008" target="_blank">8</a>]. We define an off-target heptamer seed match as a perfect match of nucleotides 2–8 of the guide strand within the 3‘UTR of an unintended transcript. (<b>B</b>) While an on-target siRNA effect is limited to one or few different transcripts, mostly for one gene, a match for a seed can occur in thousands of different transcripts and several times within one 3‘UTR. (<b>C</b>) For pooled screens the elucidation of seed-based off-target effects is much more complex than for single screened siRNAs. The seeds of the three pool siRNAs may match thousands of transcripts and may translate into unintentional transcript silencing. For an on-target pool situation (left) it is always known from which transcript knock-down the phenotype results (yellow flash symbols near the transcript) while for the off-target situation it is unknown from which on- or off-target knock-down of transcripts the phenotype for a pool results (yellow and grey flash symbols near the pool).</p
Seed enrichment visualizations for 4 high scoring off-targets.
<p>Density curves show the tendency of high scoring positive (red) and negative (blue) off-targets. The x-axes show the rank of the indicated numbers of seeds while the density of the respective ranks is shown on the y-axes. The difference in trends of high and low scoring off-targets is clearly visible by left- and right-skewed densities, respectively. ZEB1 (top left) and CDH1 (bottom right) were the most significant off-targets observed.</p
Primary screening results and expression of screened targets.
<p>(<b>A</b>) Overlaid box and violin plot showing the primary screen phenotype distribution. Colored circles show effects of the ZEB1 positive control pool (red), the CDH1 pool (gold) and the CDK5R1 pool (blue), respectively. The dashed grey line indicates the hit threshold. (<b>B</b>) Histogram of log values of primary screening results combined with the expression status for a subset of 8,977 genes. The red dashed line indicates the hit threshold.</p
Contingency table.
<p>8,977 genes that were classified as expressed or non-expressed by integrating two gene expression data sets for PANC-1 cells (only genes that are absent or present in all data sets were considered) were examined for expression by integrating two PANC-1 expression data sets and assigned with an absent or present expression status by stringent criteria. For genes targeted by siRNA pools exhibiting a significant phenotype (primary screening hits) no significant difference between expressed and non-expressed genes could be detected (p = 0.32).</p><p>Contingency table.</p