46 research outputs found
RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics
<p>Abstract</p> <p>Background</p> <p>The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic <it>E</it>-values when assigning statistical significance to candidate peptides.</p> <p>Results</p> <p>Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's <it>t</it>-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported <it>P</it>-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.</p
RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration
<p>Abstract</p> <p>Background</p> <p>Existing scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies.</p> <p>Description</p> <p>We have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the <it>correlation </it>of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting <it>novel </it>polymorphisms as well as to search a knowledge database created by the users.</p> <p>Conclusion</p> <p>We have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is <url>http://www.ncbi.nlm.nih.gov/CBBResearch/qmbp/RAId_DbS/index.html</url>. The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.</p
Evolution of alternative and constitutive regions of mammalian 5'UTRs
<p>Abstract</p> <p>Background</p> <p>Alternative splicing (AS) in protein-coding sequences has emerged as an important mechanism of regulation and diversification of animal gene function. By contrast, the extent and roles of alternative events including AS and alternative transcription initiation (ATI) within the 5'-untranslated regions (5'UTRs) of mammalian genes are not well characterized.</p> <p>Results</p> <p>We evaluated the abundance, conservation and evolution of putative regulatory control elements, namely, upstream start codons (uAUGs) and open reading frames (uORFs), in the 5'UTRs of human and mouse genes impacted by alternative events. For genes with alternative 5'UTRs, the fraction of alternative sequences (those present in a subset of the transcripts) is much greater than that in the corresponding coding sequence, conceivably, because 5'UTRs are not bound by constraints on protein structure that limit AS in coding regions. Alternative regions of mammalian 5'UTRs evolve faster and are subject to a weaker purifying selection than constitutive portions. This relatively weak selection results in over-abundance of uAUGs and uORFs in the alternative regions of 5'UTRs compared to constitutive regions. Nevertheless, even in alternative regions, uORFs evolve under a stronger selection than the rest of the sequences, indicating that some of the uORFs are conserved regulatory elements; some of the non-conserved uORFs could be involved in species-specific regulation.</p> <p>Conclusion</p> <p>The findings on the evolution and selection in alternative and constitutive regions presented here are consistent with the hypothesis that alternative events, namely, AS and ATI, in 5'UTRs of mammalian genes are likely to contribute to the regulation of translation.</p
Calibrating E-values for MS2 database search methods
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licens
Distinct Patterns of Expression and Evolution of Intronless and Intron-Containing Mammalian Genes
Comparison of expression levels and breadth and evolutionary rates of intronless and intron-containing mammalian genes shows that intronless genes are expressed at lower levels, tend to be tissue specific, and evolve significantly faster than spliced genes. By contrast, monomorphic spliced genes that are not subject to detectable alternative splicing and polymorphic alternatively spliced genes show similar statistically indistinguishable patterns of expression and evolution. Alternative splicing is most common in ancient genes, whereas intronless genes appear to have relatively recent origins. These results imply tight coupling between different stages of gene expression, in particular, transcription, splicing, and nucleocytosolic transport of transcripts, and suggest that formation of intronless genes is an important route of evolution of novel tissue-specific functions in animals
Detection of co-eluted peptides using database search methods
<p>Abstract</p> <p>Background</p> <p>Current experimental techniques, especially those applying liquid chromatography mass spectrometry, have made high-throughput proteomic studies possible. The increase in throughput however also raises concerns on the accuracy of identification or quantification. Most experimental procedures select in a given MS scan only a few relatively most intense parent ions, each to be fragmented (MS<sup>2</sup>) separately, and most other minor co-eluted peptides that have similar chromatographic retention times are ignored and their information lost.</p> <p>Results</p> <p>We have computationally investigated the possibility of enhancing the information retrieval during a given LC/MS experiment by selecting the two or three most intense parent ions for simultaneous fragmentation. A set of spectra is created via superimposing a number of MS<sup>2 </sup>spectra, each can be identified by all search methods tested with high confidence, to mimick the spectra of co-eluted peptides. The generated convoluted spectra were used to evaluate the capability of several database search methods β SEQUEST, Mascot, X!Tandem, OMSSA, and RAId_DbS β in identifying true peptides from superimposed spectra of co-eluted peptides. We show that using these simulated spectra, all the database search methods will gain eventually in the number of true peptides identified by using the compound spectra of co-eluted peptides.</p> <p>Open peer review</p> <p>Reviewed by Vlad Petyuk (nominated by Arcady Mushegian), King Jordan and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.</p
Comparison of approaches for rational siRNA design leading to a new efficient and transparent method
Current literature describes several methods for the design of efficient siRNAs with 19 perfectly matched base pairs and 2βnt overhangs. Using four independent databases totaling 3336 experimentally verified siRNAs, we compared how well several of these methods predict siRNA cleavage efficiency. According to receiver operating characteristics (ROC) and correlation analyses, the best programs were BioPredsi, ThermoComposition and DSIR. We also studied individual parameters that significantly and consistently correlated with siRNA efficacy in different databases. As a result of this work we developed a new method which utilizes linear regression fitting with local duplex stability, nucleotide position-dependent preferences and total G/C content of siRNA duplexes as input parameters. The new method's discrimination ability of efficient and inefficient siRNAs is comparable with that of the best methods identified, but its parameters are more obviously related to the mechanisms of siRNA action in comparison with BioPredsi. This permits insight to the underlying physical features and relative importance of the parameters. The new method of predicting siRNA efficiency is faster than that of ThermoComposition because it does not employ time-consuming RNA secondary structure calculations and has much less parameters than DSIR. It is available as a web tool called βsiRNA scalesβ
Expansion of the human ΞΌ-opioid receptor gene architecture: novel functional variants
The ΞΌ-opioid receptor (OPRM1) is the principal receptor target for both endogenous and exogenous opioid analgesics. There are substantial individual differences in human responses to painful stimuli and to opiate drugs that are attributed to genetic variations in OPRM1. In searching for new functional variants, we employed comparative genome analysis and obtained evidence for the existence of an expanded human OPRM1 gene locus with new promoters, alternative exons and regulatory elements. Examination of polymorphisms within the human OPRM1 gene locus identified strong association between single nucleotide polymorphism (SNP) rs563649 and individual variations in pain perception. SNP rs563649 is located within a structurally conserved internal ribosome entry site (IRES) in the 5β²-UTR of a novel exon 13-containing OPRM1 isoforms (MOR-1K) and affects both mRNA levels and translation efficiency of these variants. Furthermore, rs563649 exhibits very strong linkage disequilibrium throughout the entire OPRM1 gene locus and thus affects the functional contribution of the corresponding haplotype that includes other functional OPRM1 SNPs. Our results provide evidence for an essential role for MOR-1K isoforms in nociceptive signaling and suggest that genetic variations in alternative OPRM1 isoforms may contribute to individual differences in opiate responses
RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics
Statistically meaningful comparison/combination of peptide identification
results from various search methods is impeded by the lack of a universal
statistical standard. Providing an E-value calibration protocol, we
demonstrated earlier the feasibility of translating either the score or
heuristic E-value reported by any method into the textbook-defined E-value,
which may serve as the universal statistical standard. This protocol, although
robust, may lose spectrum-specific statistics and might require a new
calibration when changes in experimental setup occur. To mitigate these issues,
we developed a new MS/MS search tool, RAId_aPS, that is able to provide
spectrum-specific E-values for additive scoring functions. Given a selection of
scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS
generates the corresponding score histograms of all possible peptides using
dynamic programming. Using these score histograms to assign E-values enables a
calibration-free protocol for accurate significance assignment for each scoring
function. RAId_aPS features four different modes: (i) compute the total number
of possible peptides for a given molecular mass range, (ii) generate the score
histogram given a MS/MS spectrum and a scoring function, (iii) reassign
E-values for a list of candidate peptides given a MS/MS spectrum and the
scoring functions chosen, and (iv) perform database searches using selected
scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of
combining results from different scoring functions using spectrum-specific
statistics. The web link is
http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant
binaries for Linux, Windows, and Mac OS X are available from the same page.Comment: 34 pages, 10 figures, 1 supplementary information file
(RAId_aPS_support.pdf). To view the supplementary file, please download and
extract the gzipped tar source file listed under "Other formats
Expression Patterns of Protein Kinases Correlate with Gene Architecture and Evolutionary Rates
Protein kinase (PK) genes comprise the third largest superfamily that occupy βΌ2% of the human genome. They encode regulatory enzymes that control a vast variety of cellular processes through phosphorylation of their protein substrates. Expression of PK genes is subject to complex transcriptional regulation which is not fully understood.Our comparative analysis demonstrates that genomic organization of regulatory PK genes differs from organization of other protein coding genes. PK genes occupy larger genomic loci, have longer introns, spacer regions, and encode larger proteins. The primary transcript length of PK genes, similar to other protein coding genes, inversely correlates with gene expression level and expression breadth, which is likely due to the necessity to reduce metabolic costs of transcription for abundant messages. On average, PK genes evolve slower than other protein coding genes. Breadth of PK expression negatively correlates with rate of non-synonymous substitutions in protein coding regions. This rate is lower for high expression and ubiquitous PKs, relative to low expression PKs, and correlates with divergence in untranslated regions. Conversely, rate of silent mutations is uniform in different PK groups, indicating that differing rates of non-synonymous substitutions reflect variations in selective pressure. Brain and testis employ a considerable number of tissue-specific PKs, indicating high complexity of phosphorylation-dependent regulatory network in these organs. There are considerable differences in genomic organization between PKs up-regulated in the testis and brain. PK genes up-regulated in the highly proliferative testicular tissue are fast evolving and small, with short introns and transcribed regions. In contrast, genes up-regulated in the minimally proliferative nervous tissue carry long introns, extended transcribed regions, and evolve slowly.PK genomic architecture, the size of gene functional domains and evolutionary rates correlate with the pattern of gene expression. Structure and evolutionary divergence of tissue-specific PK genes is related to the proliferative activity of the tissue where these genes are predominantly expressed. Our data provide evidence that physiological requirements for transcription intensity, ubiquitous expression, and tissue-specific regulation shape gene structure and affect rates of evolution