Search CORE

113 research outputs found

Chemical databases: curation or integration by user-defined equivalence?

Author: Bellis Louisa
Chambers Jon
Gaulton Anna
Hersey Anne
Overington John P.
Patrícia Bento A.
Publication venue: The Authors. Published by Elsevier Ltd.
Publication date: 31/07/2015
Field of study

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact

Elsevier - Publisher Connector

Improving the odds of drug development success through human genomics: modelling study.

Author: Casas Juan Pablo
Chopade Sandesh
Denaxas Spiros
Finan Chris
Gaulton Anna
Hemingway Harry
Hingorani Aroon D
Kruger Felix A
Kuan Valerie
MacAllister Raymond J
Overington John P
Prieto David
Sofat Reecha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/12/2019
Field of study

Lack of efficacy in the intended disease indication is the major cause of clinical phase drug development failure. Explanations could include the poor external validity of pre-clinical (cell, tissue, and animal) models of human disease and the high false discovery rate (FDR) in preclinical science. FDR is related to the proportion of true relationships available for discovery (γ), and the type 1 (false-positive) and type 2 (false negative) error rates of the experiments designed to uncover them. We estimated the FDR in preclinical science, its effect on drug development success rates, and improvements expected from use of human genomics rather than preclinical studies as the primary source of evidence for drug target identification. Calculations were based on a sample space defined by all human diseases - the 'disease-ome' - represented as columns; and all protein coding genes - 'the protein-coding genome'- represented as rows, producing a matrix of unique gene- (or protein-) disease pairings. We parameterised the space based on 10,000 diseases, 20,000 protein-coding genes, 100 causal genes per disease and 4000 genes encoding druggable targets, examining the effect of varying the parameters and a range of underlying assumptions, on the inferences drawn. We estimated γ, defined mathematical relationships between preclinical FDR and drug development success rates, and estimated improvements in success rates based on human genomics (rather than orthodox preclinical studies). Around one in every 200 protein-disease pairings was estimated to be causal (γ = 0.005) giving an FDR in preclinical research of 92.6%, which likely makes a major contribution to the reported drug development failure rate of 96%. Observed success rate was only slightly greater than expected for a random pick from the sample space. Values for γ back-calculated from reported preclinical and clinical drug development success rates were also close to the a priori estimates. Substituting genome wide (or druggable genome wide) association studies for preclinical studies as the major information source for drug target identification was estimated to reverse the probability of late stage failure because of the more stringent type 1 error rate employed and the ability to interrogate every potential druggable target in the same experiment. Genetic studies conducted at much larger scale, with greater resolution of disease end-points, e.g. by connecting genomics and electronic health record data within healthcare systems has the potential to produce radical improvement in drug development success rate

LSHTM Research Online

UCL Discovery

Whole-genome sequencing to understand the genetic architecture of common gene expression and biomarker phenotypes.

Author: Almeida Marcio
Bandinelli Stefania
Blangero John
Curran Joanne E
Duggirala Ravindranath
Ferrucci Luigi
Frayling Timothy M
Gaulton Kyle
Gibbs J Raphael
Göring Harald HH
Hernandez Dena
Johnson Matthew P
Jun Goo
Li Qibin
Lin Haoxiang
Mccarthy Mark I
Melzer David
Murray Anna
Nalls Mike
Pearson Richard
Perry John RB
Rivas Manny
Shen Juan
Singleton Andrew
Tanaka Toshiko
Tuke Marcus A
Weedon Michael N
Wood Andrew R
Xu Christopher S
Publication venue: eScholarship, University of California
Publication date: 06/11/2014
Field of study

Initial results from sequencing studies suggest that there are relatively few low-frequency (<5%) variants associated with large effects on common phenotypes. We performed low-pass whole-genome sequencing in 680 individuals from the InCHIANTI study to test two primary hypotheses: (i) that sequencing would detect single low-frequency-large effect variants that explained similar amounts of phenotypic variance as single common variants, and (ii) that some common variant associations could be explained by low-frequency variants. We tested two sets of disease-related common phenotypes for which we had statistical power to detect large numbers of common variant-common phenotype associations-11 132 cis-gene expression traits in 450 individuals and 93 circulating biomarkers in all 680 individuals. From a total of 11 657 229 high-quality variants of which 6 129 221 and 5 528 008 were common and low frequency (<5%), respectively, low frequency-large effect associations comprised 7% of detectable cis-gene expression traits [89 of 1314 cis-eQTLs at P < 1 × 10(-06) (false discovery rate ∼5%)] and one of eight biomarker associations at P < 8 × 10(-10). Very few (30 of 1232; 2%) common variant associations were fully explained by low-frequency variants. Our data show that whole-genome sequencing can identify low-frequency variants undetected by genotyping based approaches when sample sizes are sufficiently large to detect substantial numbers of common variant associations, and that common variant associations are rarely explained by single low-frequency variants of large effect

PubMed Central

eScholarship - University of California

The Open PHACTS Discovery Platform from a network biology perspective

Author: Digles Daniela
Ecker Gerhard
Evelo Chris
Gaulton Anna
Goble Carole
Harland Lee
Loizou Antonis
Lynch Nick
Miller Ryan
Papadatos George
Senger Stefan
Soiland-Reyes Stian
Williams-Jones Bryn
Willighagen Egon L
Woollard Peter
Publication venue
Publication date: 22/07/2016
Field of study

The University of Manchester - Institutional Repository

The EBI RDF platform: linked open data for the life sciences

Author: Birney Ewan
Bolleman Jerven
Brandizi Marco
Davies Mark
Garcia Leyla
Gaulton Anna
Gehant Sebastien
Jenkinson Andrew M.
Jupp Simon
Laibe Camille
Le Novère Nicolas
Malone James
Martin Maria
Parkinson Helen
Redaschi Nicole
Wimalaratne Sarala M.
Publication venue
Publication date: 02/08/2017
Field of study

Motivation: Resource description framework (RDF) is an emerging technology for describing, publishing and linking life science data. As a major provider of bioinformatics data and services, the European Bioinformatics Institute (EBI) is committed to making data readily accessible to the community in ways that meet existing demand. The EBI RDF platform has been developed to meet an increasing demand to coordinate RDF activities across the institute and provides a new entry point to querying and exploring integrated resources available at the EBI. Availability: http://www.ebi.ac.uk/rdf Contact: [email protected]

RERO DOC Digital Library

Recommended from our members

An open source chemical structure curation pipeline using RDKit.

Author: Atkinson Francis
Bellis Louisa J
Bento A Patrícia
De Veij Marleen
Félix Eloy
Gaulton Anna
Hersey Anne
Landrum Greg
Leach Andrew R
Publication venue: J Cheminform
Publication date: 01/09/2020
Field of study

Funder: European Molecular Biology Laboratory; doi: http://dx.doi.org/10.13039/100013060BACKGROUND: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. RESULTS: A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. CONCLUSION: All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation

Apollo (Cambridge)

Shouldn't enantiomeric purity be included in the minimum information about a bioactive entity? Response from the MIABE group

Author: Andrew Hopkins
Anna Gaulton
Bissan Al-Lazikani
Christoph Steinbeck
Christopher Larminie
Christopher Southan
David Wishart
Dominic Clark
Elena Lo Piparo
Elizabeth Calder
Henning Hermjakob
Ian Dix
Janet Thornton
John Overington
Kim Hammond-Kosack
Lee Harland
Mark Forster
Martin Grigorov
Michael Gilson
Nick Lynch
Ola Engkvist
Peter Murray-Rust
Robert Glen
Romeena K. Mann
S Orchard
Sandra Orchard
Steve Bryant
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

[Reply to Letter

Crossref

Rothamsted Repository