63 research outputs found

    Benchmarking network propagation methods for disease gene identification

    Get PDF
    In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genesPeer ReviewedPostprint (published version

    New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

    Get PDF
    CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily

    BRDF correction of S3 OLCI water reflectance products

    Get PDF
    Ocean Optics XXV, 2-7 October 2022, Quy Nhon, Binh Dinh, Vietnam.-- 1 page, figuresOngoing study to minimize the effects of the Bidirectional Reflectance Distribution Function (BRDF) and deliver Sentine3 OLCI fully normalized water reflectancesEUMETSAT Contract Ref.: RB_EUM-CO-21-4600002626-JIGPeer reviewe

    GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains

    Get PDF
    GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics

    Expansion of the Protein Repertoire in Newly Explored Environments: Human Gut Microbiome Specific Protein Families

    Get PDF
    The microbes that inhabit particular environments must be able to perform molecular functions that provide them with a competitive advantage to thrive in those environments. As most molecular functions are performed by proteins and are conserved between related proteins, we can expect that organisms successful in a given environmental niche would contain protein families that are specific for functions that are important in that environment. For instance, the human gut is rich in polysaccharides from the diet or secreted by the host, and is dominated by Bacteroides, whose genomes contain highly expanded repertoire of protein families involved in carbohydrate metabolism. To identify other protein families that are specific to this environment, we investigated the distribution of protein families in the currently available human gut genomic and metagenomic data. Using an automated procedure, we identified a group of protein families strongly overrepresented in the human gut. These not only include many families described previously but also, interestingly, a large group of previously unrecognized protein families, which suggests that we still have much to discover about this environment. The identification and analysis of these families could provide us with new information about an environment critical to our health and well being

    Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods

    Get PDF
    Background: Alanine scanning mutagenesis is a powerful experimental methodology for investigating the structural and energetic characteristics of protein complexes. Individual aminoacids are systematically mutated to alanine and changes in free energy of binding (Delta Delta G) measured. Several experiments have shown that protein-protein interactions are critically dependent on just a few residues ("hot spots") at the interface. Hot spots make a dominant contribution to the free energy of binding and if mutated they can disrupt the interaction. As mutagenesis studies require significant experimental efforts, there is a need for accurate and reliable computational methods. Such methods would also add to our understanding of the determinants of affinity and specificity in protein-protein recognition.Results: We present a novel computational strategy to identify hot spot residues, given the structure of a complex. We consider the basic energetic terms that contribute to hot spot interactions, i.e. van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics. We treat them as input features and use machine learning algorithms such as Support Vector Machines and Gaussian Processes to optimally combine and integrate them, based on a set of training examples of alanine mutations. We show that our approach is effective in predicting hot spots and it compares favourably to other available methods. In particular we find the best performances using Transductive Support Vector Machines, a semi-supervised learning scheme. When hot spots are defined as those residues for which Delta Delta G >= 2 kcal/mol, our method achieves a precision and a recall respectively of 56% and 65%.Conclusion: We have developed an hybrid scheme in which energy terms are used as input features of machine learning models. This strategy combines the strengths of machine learning and energy-based methods. Although so far these two types of approaches have mainly been applied separately to biomolecular problems, the results of our investigation indicate that there are substantial benefits to be gained by their integration

    Optical classification of contrasted coastal waters

    No full text
    International audienceOptical water types were identified from an in situ data set of concomitant biogeochemical and optical parameters collected in contrasted turbid coastal areas of the eastern English Channel, southern North Sea and French Guiana at different seasons (211 stations). Four optical classes have been defined using a clustering approach performed on the spectrally normalized reflectance spectra. Normalization of the reflectance spectra was carried out during the statistical analysis to emphasize the shape of the reflectances rather than their magnitude. Each optical water type is associated with a specific bio-optical environment, in agreement with previous works. Two classes present a very marked optical character, one being mostly determined by strong phytoplankton and dissolved material loads, and the other one by a high proportion of mineral particles. The two remaining classes are related to optically mixed situations although there are some differences in the relative proportion of particulate mineral material. Applying this optical typology to the SeaWiFS daily reflectance data, we emphasized the high representativeness of these 4 optical water types which allow to describe about two thirds of the reflectance spectra found within the development sites whatever the season. The adequacy of optical water type definition for monitoring the spatio-temporal variability of coastal water masses optical quality, which reflects the impact of biological and hydrodynamic processes occurring at different time scales (i.e. from high frequency to seasonal processes), has been demonstrated. The four optical classes' typology has been shown to remain highly representative at global scale. This underlines the effective optical vicinity of some parts of the coastal ocean during some periods of the year, in spite of the recognized high optical diversity of coastal waters. This further demonstrates the high pertinence of class-based approach for large scale coastal applications. Finally, the potential for class-based inversion algorithms for improving ocean color products retrieval, as well as the applicability of such class-specific algorithms with respect to satellite information have been illustrated from the estimation of the suspended matter concentration. This work provides very encouraging evidence of the potential and adequacy of class-based inversion methods for deriving bio-optical products in optically complex waters such as the coastal ocean

    Assessment of the colored dissolved organic matter in coastal waters from ocean color remote sensing

    No full text
    International audienceKnowledge on absorption by colored dissolved organic matter, acdom, spatio-temporal variability in coastal areas is of fundamental importance in many field of researches related to biogeochemical cycles studies, coastal areas management, as well as land and water interactions in the coastal domain. A new method, based on the theoretical link between the vertical attenuation coefficient, Kd, and the absorption coefficient, has been developed to assess acdom. This method, confirmed from radiative transfer simulations and in situ measurements, and tested on an independent in situ data set (N = 126), allows acdom to be assessed with a Mean Relative Absolute Difference, MRAD, of 33% over two order of magnitude (from 0.01 to 1.16 m−1). In the frame of ocean color observation, Kd is not directly measured but estimated from the remote sensing reflectance, Rrs. Based on 109 satellite (SeaWiFS) and in situ coincident (i.e. match-up) data points acdom is retrieved with a MRAD value of 37%. This simple model generally presents slightly better performances than recently developed empirical or semi-analytical algorithms

    Assessing the Impact of a Two-Layered Spherical Geometry of Phytoplankton Cells on the Bulk Backscattering Ratio of Marine Particulate Matter

    No full text
    The bulk backscattering ratio ( b b p ˜ ) is commonly used as a descriptor of the bulk real refractive index of the particulate assemblage in natural waters. Based on numerical simulations, we analyze the impact of modeled structural heterogeneity of phytoplankton cells on b b p ˜ . b b p ˜ is modeled considering viruses, heterotrophic bacteria, phytoplankton, organic detritus, and minerals. Three case studies are defined according to the relative abundance of the components. Two case studies represent typical situations in open ocean, oligotrophic waters, and phytoplankton bloom. The third case study is typical of coastal waters with the presence of minerals. Phytoplankton cells are modeled by a two-layered spherical geometry representing a chloroplast surrounding the cytoplasm. The b b p ˜ values are higher when structural heterogeneity is considered because the contribution of coated spheres to light backscattering is higher than homogeneous spheres. The impact of heterogeneity is; however, strongly conditioned by the hyperbolic slope ξ of the particle size distribution. Even if the relative abundance of phytoplankton is small (<1%), b b p ˜ increases by about 58% (for ξ = 4 and for oligotrophic waters), when the heterogeneity is taken into account, in comparison with a particulate population composed only of homogeneous spheres. As expected, heterogeneity has a much smaller impact (about 12% for ξ = 4 ) on b b p ˜ in the presence of suspended minerals, whose increased light scattering overwhelms that of phytoplankton

    Effect of inherent optical properties variability on the chlorophyll retrieval from ocean color remote sensing: an in situ approach

    No full text
    The impact of the inherent optical properties (IOP) variability on the chlorophyll, Chl, retrieval from ocean color remote sensing algorithms is analyzed from an in situ data set covering a large dynamic range. The effect of the variability of the specific phytoplankton absorption coefficient, a(phy)/Chl, specific particulate backscattering coefficient, b(bp)/Chl, and colored detrital matter absorption to non-water absorption ratio, a(cdm)/a(nw), on the performance of standard operational algorithms is examined. This study confirms that empirical algorithms are highly dependent on the specifics IOP values (especially b(bp)/Chl and a(cdm)/a(nw)): Chl is over-estimated in waters with specific IOP values higher than averaged values, and vice versa. These results clearly indicate the necessity to account for the influence of the specific IOP variability in Chl retrieval algorithms
    corecore