165 research outputs found
Users Guide for SnadiOpt: A Package Adding Automatic Differentiation to Snopt
SnadiOpt is a package that supports the use of the automatic differentiation
package ADIFOR with the optimization package Snopt. Snopt is a general-purpose
system for solving optimization problems with many variables and constraints.
It minimizes a linear or nonlinear function subject to bounds on the variables
and sparse linear or nonlinear constraints. It is suitable for large-scale
linear and quadratic programming and for linearly constrained optimization, as
well as for general nonlinear programs. The method used by Snopt requires the
first derivatives of the objective and constraint functions to be available.
The SnadiOpt package allows users to avoid the time-consuming and error-prone
process of evaluating and coding these derivatives. Given Fortran code for
evaluating only the values of the objective and constraints, SnadiOpt
automatically generates the code for evaluating the derivatives and builds the
relevant Snopt input files and sparse data structures.Comment: pages i-iv, 1-2
Evaluating annotations of an Agilent expression chip suggests that many features cannot be interpreted
<p>Abstract</p> <p>Background</p> <p>While attempting to reanalyze published data from Agilent 4 × 44 human expression chips, we found that some of the 60-mer olignucleotide features could not be interpreted as representing single human genes. For example, some of the oligonucleotides align with the transcripts of more than one gene. We decided to check the annotations for all autosomes and the X chromosome systematically using bioinformatics methods.</p> <p>Results</p> <p>Out of 42683 reporters, we found that 25505 (60%) passed all our tests and are considered "fully valid". 9964 (23%) reporters did not have a meaningful identifier, mapped to the wrong chromosome, or did not pass basic alignment tests preventing us from correlating the expression values of these reporters with a unique annotated human gene. The remaining 7214 (17%) reporters could be associated with either a unique gene or a unique intergenic location, but could not be mapped to a transcript in RefSeq. The 7214 reporters are further partitioned into three different levels of validity.</p> <p>Conclusion</p> <p>Expression array studies should evaluate the annotations of reporters and remove those reporters that have suspect annotations. This evaluation can be done systematically and semi-automatically, but one must recognize that data sources are frequently updated leading to slightly changing validation results over time.</p
Genome-wide changes in protein translation efficiency are associated with autism
We previously proposed that changes in the efficiency of protein translation are associated with autism spectrum disorders (ASDs). This hypothesis connects environmental factors and genetic factors because each can alter translation efficiency. For genetic factors, we previously tested our hypothesis using a small set of ASD-associated genes, a small set of ASD-associated variants, and a statistic to quantify by how much a single nucleotide variant (SNV) in a protein coding region changes translation speed. In this study, we confirm and extend our hypothesis using a published set of 1,800 autism quartets (parents, one affected child and one unaffected child) and genome-wide variants. Then, we extend the test statistic to combine translation efficiency with other possibly relevant variables: ribosome profiling data, presence/absence of CpG dinucleotides, and phylogenetic conservation. The inclusion of ribosome profiling abundances strengthens our results for male–male sibling pairs. The inclusion of CpG information strengthens our results for female–female pairs, giving an insight into the significant gender differences in autism incidence. By combining the single-variant test statistic for all variants in a gene, we obtain a single gene score to evaluate how well a gene distinguishes between affected and unaffected siblings. Using statistical methods, we compute gene sets that have some power to distinguish between affected and unaffected siblings by translation efficiency of gene variants. Pathway and enrichment analysis of those gene sets suggest the importance of Wnt signaling pathways, some other pathways related to cancer, ATP binding, and ATP-ase pathways in the etiology of ASDs
Promoter-distal RNA polymerase II binding discriminates active from inactive CCAAT/ enhancer-binding protein beta binding sites
Transcription factors (TFs) bind to thousands of DNA sequences in mammalian genomes, but most of these binding events appear to have no direct effect on gene expression. It is unclear why only a subset of TF bound sites are actively involved in transcriptional regulation. Moreover, the key genomic features that accurately discriminate between active and inactive TF binding events remain ambiguous. Recent studies have identified promoter-distal RNA polymerase II (RNAP2) binding at enhancer elements, suggesting that these interactions may serve as a marker for active regulatory sequences. Despite these correlative analyses, a thorough functional validation of these genomic co-occupancies is still lacking. To characterize the gene regulatory activity of DNA sequences underlying promoter-distal TF binding events that co-occur with RNAP2 and TF sites devoid of RNAP2 occupancy using a functional reporter assay, we performed cis-regulatory element sequencing (CRE-seq). We tested more than 1000 promoter-distal CCAAT/enhancer-binding protein beta (CEBPB)-bound sites in HepG2 and K562 cells, and found that CEBPB-bound sites co-occurring with RNAP2 were more likely to exhibit enhancer activity. CEBPB-bound sites further maintained substantial cell-type specificity, indicating that local DNA sequence can accurately convey cell-type–specific regulatory information. By comparing our CRE-seq results to a comprehensive set of genome annotations, we identified a variety of genomic features that are strong predictors of regulatory element activity and cell-type–specific activity. Collectively, our functional assay results indicate that RNAP2 occupancy can be used as a key genomic marker that can distinguish active from inactive TF bound sites
Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST
BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set
PSI-BLAST pseudocounts and the minimum description length principle
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default
Trees on networks: resolving statistical patterns of phylogenetic similarities among interacting proteins
<p>Abstract</p> <p>Background</p> <p>Phylogenies capture the evolutionary ancestry linking extant species. Correlations and similarities among a set of species are mediated by and need to be understood in terms of the phylogenic tree. In a similar way it has been argued that biological networks also induce correlations among sets of interacting genes or their protein products.</p> <p>Results</p> <p>We develop suitable statistical resampling schemes that can incorporate these two potential sources of correlation into a single inferential framework. To illustrate our approach we apply it to protein interaction data in yeast and investigate whether the phylogenetic trees of interacting proteins in a panel of yeast species are more similar than would be expected by chance.</p> <p>Conclusions</p> <p>While we find only negligible evidence for such increased levels of similarities, our statistical approach allows us to resolve the previously reported contradictory results on the levels of co-evolution induced by protein-protein interactions. We conclude with a discussion as to how we may employ the statistical framework developed here in further functional and evolutionary analyses of biological networks and systems.</p
The Drosophila Gap Gene Network Is Composed of Two Parallel Toggle Switches
Drosophila “gap” genes provide the first response to maternal gradients in the early fly embryo. Gap genes are expressed in a series of broad bands across the embryo during first hours of development. The gene network controlling the gap gene expression patterns includes inputs from maternal gradients and mutual repression between the gap genes themselves. In this study we propose a modular design for the gap gene network, involving two relatively independent network domains. The core of each network domain includes a toggle switch corresponding to a pair of mutually repressive gap genes, operated in space by maternal inputs. The toggle switches present in the gap network are evocative of the phage lambda switch, but they are operated positionally (in space) by the maternal gradients, so the synthesis rates for the competing components change along the embryo anterior-posterior axis. Dynamic model, constructed based on the proposed principle, with elements of fractional site occupancy, required 5–7 parameters to fit quantitative spatial expression data for gap gradients. The identified model solutions (parameter combinations) reproduced major dynamic features of the gap gradient system and explained gap expression in a variety of segmentation mutants
Streptococcus pneumoniae Clonal Complex 199: Genetic Diversity and Tissue-Specific Virulence
Streptococcus pneumoniae is an important cause of otitis media and invasive disease. Since introduction of the heptavalent pneumococcal conjugate vaccine, there has been an increase in replacement disease due to serotype 19A clonal complex (CC)199 isolates. The goals of this study were to 1) describe genetic diversity among nineteen CC199 isolates from carriage, middle ear, blood, and cerebrospinal fluid, 2) compare CC199 19A (n = 3) and 15B/C (n = 2) isolates in the chinchilla model for pneumococcal disease, and 3) identify accessory genes associated with tissue-specific disease among a larger collection of S. pneumoniae isolates. CC199 isolates were analyzed by comparative genome hybridization. One hundred and twenty-seven genes were variably present. The CC199 phylogeny split into two main clades, one comprised predominantly of carriage isolates and another of disease isolates. Ability to colonize and cause disease did not differ by serotype in the chinchilla model. However, isolates from the disease clade were associated with faster time to bacteremia compared to carriage clade isolates. One 19A isolate exhibited hypervirulence. Twelve tissue-specific genes/regions were identified by correspondence analysis. After screening a diverse collection of 326 isolates, spr0282 was associated with carriage. Four genes/regions, SP0163, SP0463, SPN05002 and RD8a were associated with middle ear isolates. SPN05002 also associated with blood and CSF, while RD8a associated with blood isolates. The hypervirulent isolate's genome was sequenced using the Solexa paired-end sequencing platform and compared to that of a reference serotype 19A isolate, revealing the presence of a novel 20 kb region with sequence similarity to bacteriophage genes. Genetic factors other than serotype may modulate virulence potential in CC199. These studies have implications for the long-term effectiveness of conjugate vaccines. Ideally, future vaccines would target common proteins to effectively reduce carriage and disease in the vaccinated population
- …