525 research outputs found

    Nat Methods

    Get PDF
    R01 OD010929/ODCDC CDC HHS/Office of the Director/United States2020-02-01T00:00:00Z31217594PMC66690996525vault:3358

    Investigating the biological relevance in trained embedding representations of protein sequences

    Get PDF
    As genome sequencing is becoming faster and cheaper, an abundance of DNA and protein sequence data is available. However, experimental annotation of structural or functional information develops at a much slower pace. Therefore, machine learning techniques have been widely adopted to make accurate predictions on unseen sequence data. In recent years, deep learning has been gaining popularity, as it allows for effective end-to-end learning. One consideration for its application on sequence data is the choice for a suitable and effective sequence representation strategy. In this paper, we investigate the significance of three common encoding schemes on the multi-label prediction problem of Gene Ontology (GO) term annotation, namely a one-hot encoding, an ad-hoc trainable embedding, and pre-trained protein vectors, using different hyper-parameters. We found that traditional unigram one-hot encodings achieved very good results, only slightly outperformed by unigram ad-hoc trainable embeddings and bigram pre-trained embeddings (by at most 3%for the F maxscore), suggesting the exploration of different encoding strategies to be potentially beneficial. Most interestingly, when analyzing and visualizing the trained embeddings, we found that biologically relevant (dis)similarities between amino acid n-grams were implicitly learned, which were consistent with their physiochemical properties

    A novel family of bacterial sialic acid binding adhesins?

    Get PDF
    Abstract of Distinction at Nationwide Children's Annual Research RetreatThird place in Biomedical Sciences at the Denman Undergraduate Research ForumPlatelet binding is a critical step in the development of Infective Endocarditis (IE), an infection of the endocardium. However, the mechanisms utilized by IE causing species for binding to platelets remain understudied. Subacute IE, is associated with previously damaged heart valves and is often caused by the bacterial species Streptococcus oralis. S. oralis binds sialic acid, a host carbohydrate found on the surface of platelets. A novel sialic acid binding adhesin, AsaA, was identified in S. oralis. We identified orthologs of AsaA in two other bacterial species that cause IE, Gemella haemolysans and Granulicatella elegans. Our hypothesis is that AsaA mediates adherence of multiple species. G. haemolysans was selected for further study and shares 62% predicted amino acid identity with the non-repeat region of AsaA from S. oralis. G. haemolysans has rarely been studied, so we began by optimizing growth conditions. Binding of the species to platelets was consistently low which prevented the direct assessment of the role of G. haemolysans AsaA in adhesion. Given the same sialic acid binding specificities, the adherence of S. oralis to platelets can be competitively inhibited by recombinant proteins that bind terminal sialic acid. Hence, we recombinantly expressed the binding region of AsaA from G. haemolysans (AsaA_NRGh) and observed the impact of the protein on the adherence of S. oralis to platelets. AsaA_NRGh competitively inhibited adhesion of S. oralis to platelets in an AsaA dependent manner. This finding supports the hypothesis that G. haemolysans AsaA acts as an adhesin. If AsaA is a conserved mechanism of adhesion, a single treatment or preventative measure may target multiple IE causing species.AHA Grant to SJKNo embargoAcademic Major: Microbiolog

    On generative models of T-cell receptor sequences

    Full text link
    T-cell receptors (TCR) are key proteins of the adaptive immune system, generated randomly in each individual, whose diversity underlies our ability to recognize infections and malignancies. Modeling the distribution of TCR sequences is of key importance for immunology and medical applications. Here, we compare two inference methods trained on high-throughput sequencing data: a knowledge-guided approach, which accounts for the details of sequence generation, supplemented by a physics-inspired model of selection; and a knowledge-free Variational Auto-Encoder based on deep artificial neural networks. We show that the knowledge-guided model outperforms the deep network approach at predicting TCR probabilities, while being more interpretable, at a lower computational cost

    Fatty acid bioconversion in harpacticoid copepods in a changing environment : a transcriptomic approach

    Get PDF
    By 2100, global warming is predicted to significantly reduce the capacity of marine primary producers for long-chain polyunsaturated fatty acid (LC-PUFA) synthesis. Primary consumers such as harpacticoid copepods (Crustacea) might mitigate the resulting adverse effects on the food web by increased LC-PUFA bioconversion. Here, we present a high-quality de novo transcriptome assembly of the copepodPlatychelipus littoralis, exposed to changes in both temperature (+3 degrees C) and dietary LC-PUFA availability. Using this transcriptome, we detected multiple transcripts putatively coding for LC-PUFA-bioconverting front-end fatty acid (FA) desaturases and elongases, and performed phylogenetic analyses to identify their relationship with sequences of other (crustacean) taxa. While temperature affected the absolute FA concentrations in copepods, LC-PUFA levels remained unaltered even when copepods were fed an LC-PUFA-deficient diet. While this suggests plasticity of LC-PUFA bioconversion withinP. littoralis, none of the putative front-end desaturase or elongase transcripts was differentially expressed under the applied treatments. Nevertheless, the transcriptome presented here provides a sound basis for future ecophysiological research on harpacticoid copepods. This article is part of the theme issue 'The next horizons for lipids as 'trophic biomarkers': evidence and significance of consumer modification of dietary fatty acids'
    corecore