1,505 research outputs found

    Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent advances in high-throughput screening (HTS) techniques and readily available compound libraries generated using combinatorial chemistry or derived from natural products enable the testing of millions of compounds in a matter of days. Due to the amount of information produced by HTS assays, it is a very challenging task to mine the HTS data for potential interest in drug development research. Computational approaches for the analysis of HTS results face great challenges due to the large quantity of information and significant amounts of erroneous data produced.</p> <p>Results</p> <p>In this study, Decision Trees (DT) based models were developed to discriminate compound bioactivities by using their chemical structure fingerprints provided in the PubChem system <url>http://pubchem.ncbi.nlm.nih.gov</url>. The DT models were examined for filtering biological activity data contained in four assays deposited in the PubChem Bioassay Database including assays tested for 5HT1a agonists, antagonists, and HIV-1 RT-RNase H inhibitors. The 10-fold Cross Validation (CV) sensitivity, specificity and Matthews Correlation Coefficient (MCC) for the models are 57.2~80.5%, 97.3~99.0%, 0.4~0.5 respectively. A further evaluation was also performed for DT models built for two independent bioassays, where inhibitors for the same HIV RNase target were screened using different compound libraries, this experiment yields enrichment factor of 4.4 and 9.7.</p> <p>Conclusion</p> <p>Our results suggest that the designed DT models can be used as a virtual screening technique as well as a complement to traditional approaches for hits selection.</p

    PubChem3D: Shape compatibility filtering using molecular shape quadrupoles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>PubChem provides a 3-D neighboring relationship, which involves finding the maximal shape overlap between two static compound 3-D conformations, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of pre-filters, based on the concept of volume, to remove approximately 65% of all conformer neighbor pairs prior to shape overlap optimization. Given that molecular volume, a somewhat vague concept, is rather effective, it leads one to wonder: can the existing PubChem 3-D neighboring relationship, which consists of billions of shape similar conformer pairs from tens of millions of unique small molecules, be used to identify additional shape descriptor relationships? Or, put more specifically, can one place an upper bound on shape similarity using other "fuzzy" shape-like concepts like length, width, and height?</p> <p>Results</p> <p>Using a basis set of 4.18 billion 3-D neighbor pairs identified from single conformer per compound neighboring of 17.1 million molecules, shape descriptors were computed for all conformers. These steric shape descriptors included several forms of molecular volume and shape quadrupoles, which essentially embody the length, width, and height of a conformer. For a given 3-D neighbor conformer pair, the volume and each quadrupole component (Q<sub>x</sub>, Q<sub>y</sub>, and Q<sub>z</sub>) were binned and their frequency of occurrence was examined. Per molecular volume type, this effectively produced three different maps, one per quadrupole component (Q<sub>x</sub>, Q<sub>y</sub>, and Q<sub>z</sub>), of allowed values for the similarity metric, shape Tanimoto (ST) ≥ 0.8.</p> <p>The efficiency of these relationships (in terms of true positive, true negative, false positive and false negative) as a function of ST threshold was determined in a test run of 13.2 billion conformer pairs not previously considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering efficiency of 40.4% of true negatives was achieved with only 32 false negatives out of 24 million true positives, when applying the separate Q<sub>x</sub>, Q<sub>y</sub>, and Q<sub>z </sub>maps in a series (Q<sub>xyz</sub>). This efficiency increased linearly as a function of ST threshold in the range 0.8-0.99. The Q<sub>x </sub>filter was consistently the most efficient followed by Q<sub>y </sub>and then by Q<sub>z</sub>. Use of a monopole volume showed the best overall performance, followed by the self-overlap volume and then by the analytic volume.</p> <p>Application of the monopole-based Q<sub>xyz </sub>filter in a "real world" test of 3-D neighboring of 4,218 chemicals of biomedical interest against 26.1 million molecules in PubChem reduced the total CPU cost of neighboring by between 24-38% and, if used as the initial filter, removed from consideration 48.3% of all conformer pairs at almost negligible computational overhead.</p> <p>Conclusion</p> <p>Basic shape descriptors, such as those embodied by size, length, width, and height, can be highly effective in identifying shape incompatible compound conformer pairs. When performing a 3-D search using a shape similarity cut-off, computation can be avoided by identifying conformer pairs that cannot meet the result criteria. Applying this methodology as a filter for PubChem 3-D neighboring computation, an improvement of 31% was realized, increasing the average conformer pair throughput from 154,000 to 202,000 per second per CPU core.</p

    Improving protein structure similarity searches using domain boundaries based on conserved sequence information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of protein domains plays an important role in protein structure comparison. Domain query size and composition are critical to structure similarity search algorithms such as the Vector Alignment Search Tool (VAST), the method employed for computing related protein structures in NCBI Entrez system. Currently, domains identified on the basis of structural compactness are used for VAST computations. In this study, we have investigated how alternative definitions of domains derived from conserved sequence alignments in the Conserved Domain Database (CDD) would affect the domain comparisons and structure similarity search performance of VAST.</p> <p>Results</p> <p>Alternative domains, which have significantly different secondary structure composition from those based on structurally compact units, were identified based on the alignment footprints of curated protein sequence domain families. Our analysis indicates that domain boundaries disagree on roughly 8% of protein chains in the medium redundancy subset of the Molecular Modeling Database (MMDB). These conflicting sequence based domain boundaries perform slightly better than structure domains in structure similarity searches, and there are interesting cases when structure similarity search performance is markedly improved.</p> <p>Conclusion</p> <p>Structure similarity searches using domain boundaries based on conserved sequence information can provide an additional method for investigators to identify interesting similarities between proteins with known structures. Because of the improvement in performance of structure similarity searches using sequence domain boundaries, we are in the process of implementing their inclusion into the VAST search and MMDB resources in the NCBI Entrez system.</p

    Predicting drug–drug interactions through drug structural similarities and interaction networks incorporating pharmacokinetics and pharmacodynamics knowledge

    Get PDF
    Additional file 1. Table S1. Average structural similarity scores for the DDI/non–DDI pairs in the network of each De. Table S2-1. Top 10 predicted drugs with DDIs for warfarin. Table S2-2. Top 10 predicted drugs with DDIs for simvastatin. Table S3. Four-fold cross-validation test results. Text S1. Drugs that show DDI (DrugBank ID). Figure S1. Illustration of construction of training and test set for 4-fold cross validation. Figure S2. ROC curves using the models with score set 1 in a 4-fold validation

    Knowledge-based annotation of small molecule binding sites in proteins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The study of protein-small molecule interactions is vital for understanding protein function and for practical applications in drug discovery. To benefit from the rapidly increasing structural data, it is essential to improve the tools that enable large scale binding site prediction with greater emphasis on their biological validity.</p> <p>Results</p> <p>We have developed a new method for the annotation of protein-small molecule binding sites, using inference by homology, which allows us to extend annotation onto protein sequences without experimental data available. To ensure biological relevance of binding sites, our method clusters similar binding sites found in homologous protein structures based on their sequence and structure conservation. Binding sites which appear evolutionarily conserved among non-redundant sets of homologous proteins are given higher priority. After binding sites are clustered, position specific score matrices (PSSMs) are constructed from the corresponding binding site alignments. Together with other measures, the PSSMs are subsequently used to rank binding sites to assess how well they match the query and to better gauge their biological relevance. The method also facilitates a succinct and informative representation of observed and inferred binding sites from homologs with known three-dimensional structures, thereby providing the means to analyze conservation and diversity of binding modes. Furthermore, the chemical properties of small molecules bound to the inferred binding sites can be used as a starting point in small molecule virtual screening. The method was validated by comparison to other binding site prediction methods and to a collection of manually curated binding site annotations. We show that our method achieves a sensitivity of 72% at predicting biologically relevant binding sites and can accurately discriminate those sites that bind biological small molecules from non-biological ones.</p> <p>Conclusions</p> <p>A new algorithm has been developed to predict binding sites with high accuracy in terms of their biological validity. It also provides a common platform for function prediction, knowledge-based docking and for small molecule virtual screening. The method can be applied even for a query sequence without structure. The method is available at <url>http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi</url>.</p

    Refining multiple sequence alignments with conserved core regions

    Get PDF
    Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution () and will be incorporated into the next release of the Cn3D structure/alignment viewer

    A Case of Infectious Purpura Fulminans: An Unusual Organism and Method of Diagnosis

    Get PDF
    Infectious purpura fulminans is a rapidly progressive skin necrosis that carries a mortality rate of 30%. Here, we described a case of infectious purpura fulminans caused by Capnocytophaga diagnosed by a blood film
    • …
    corecore