169 research outputs found

    Biocurators and Biocuration: surveying the 21st century challenges

    Get PDF
    Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. Its goals include fostering communication between biocurators, promoting and describing their work, and highlighting the added value of biocuration to the world. The ISB recently conducted a survey of biocurators to better understand their educational and scientific backgrounds, their motivations for choosing a curatorial job and their career goals. The results are reported here. From the responses received, it is evident that biocuration is performed by highly trained scientists and perceived to be a stimulating career, offering both intellectual challenges and the satisfaction of performing work essential to the modern scientific community. It is also apparent that the ISB has at least a dual role to play to facilitate biocurators’ work: (i) to promote biocuration as a career within the greater scientific community; (ii) to aid the development of resources for biomedical research through promotion of nomenclature and data-sharing standards that will allow interconnection of biological databases and better exploit the pivotal contributions that biocurators are making

    Data sharing and ontology use among agricultural genetics, genomics, and breeding databases and resources of the AgBioData Consortium

    Full text link
    Over the last several decades, there has been rapid growth in the number and scope of agricultural genetics, genomics and breeding (GGB) databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as 'databases' throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, conducted a survey to assess the status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data sharing practices by AgBioData databases are in a healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that ontology use has not substantially changed since a similar survey was conducted in 2017. We recommend 1) providing training for database personnel in specific data sharing techniques, as well as in ontology use; 2) further study on what metadata is shared, and how well it is shared among databases; 3) promoting an understanding of data sharing and ontologies in the stakeholder community; 4) improving data sharing and ontologies for specific phenotypic data types and formats; and 5) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means.Comment: 17 pages, 8 figure

    MetWAMer: eukaryotic translation initiation site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations.</p> <p>Results</p> <p>MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the <it>k</it>-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage.</p> <p>Conclusion</p> <p>We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.</p

    Expressed sequence tag analysis of khat (Catha edulis) provides a putative molecular biochemical basis for the biosynthesis of phenylpropylamino alkaloids

    Get PDF
    Khat (Catha edulis Forsk.) is a flowering perennial shrub cultivated for its neurostimulant properties resulting mainly from the occurrence of (S)-cathinone in young leaves. The biosynthesis of (S)-cathinone and the related phenylpropylamino alkaloids (1S,2S)-cathine and (1R,2S)-norephedrine is not well characterized in plants. We prepared a cDNA library from young khat leaves and sequenced 4,896 random clones, generating an expressed sequence tag (EST) library of 3,293 unigenes. Putative functions were assigned to > 98% of the ESTs, providing a key resource for gene discovery. Candidates potentially involved at various stages of phenylpropylamino alkaloid biosynthesis from L-phenylalanine to (1S,2S)-cathine were identified

    Annotation of gene product function from high-throughput studies using the Gene Ontology

    Get PDF
    High-throughput studies constitute an essential and valued source of information for researchers. However, high-throughput experimental workflows are often complex, with multiple data sets that may contain large numbers of false positives. The representation of high-throughput data in the Gene Ontology (GO) therefore presents a challenging annotation problem, when the overarching goal of GO curation is to provide the most precise view of a gene's role in biology. To address this, representatives from annotation teams within the GO Consortium reviewed high-throughput data annotation practices. We present an annotation framework for high-throughput studies that will facilitate good standards in GO curation and, through the use of new high-throughput evidence codes, increase the visibility of these annotations to the research community

    EMF1 and PRC2 Cooperate to Repress Key Regulators of Arabidopsis Development

    Get PDF
    EMBRYONIC FLOWER1 (EMF1) is a plant-specific gene crucial to Arabidopsis vegetative development. Loss of function mutants in the EMF1 gene mimic the phenotype caused by mutations in Polycomb Group protein (PcG) genes, which encode epigenetic repressors that regulate many aspects of eukaryotic development. In Arabidopsis, Polycomb Repressor Complex 2 (PRC2), made of PcG proteins, catalyzes trimethylation of lysine 27 on histone H3 (H3K27me3) and PRC1-like proteins catalyze H2AK119 ubiquitination. Despite functional similarity to PcG proteins, EMF1 lacks sequence homology with known PcG proteins; thus, its role in the PcG mechanism is unclear. To study the EMF1 functions and its mechanism of action, we performed genome-wide mapping of EMF1 binding and H3K27me3 modification sites in Arabidopsis seedlings. The EMF1 binding pattern is similar to that of H3K27me3 modification on the chromosomal and genic level. ChIPOTLe peak finding and clustering analyses both show that the highly trimethylated genes also have high enrichment levels of EMF1 binding, termed EMF1_K27 genes. EMF1 interacts with regulatory genes, which are silenced to allow vegetative growth, and with genes specifying cell fates during growth and differentiation. H3K27me3 marks not only these genes but also some genes that are involved in endosperm development and maternal effects. Transcriptome analysis, coupled with the H3K27me3 pattern, of EMF1_K27 genes in emf1 and PRC2 mutants showed that EMF1 represses gene activities via diverse mechanisms and plays a novel role in the PcG mechanism

    Network Analysis Identifies ELF3 as a QTL for the Shade Avoidance Response in Arabidopsis

    Get PDF
    Quantitative Trait Loci (QTL) analyses in immortal populations are a powerful method for exploring the genetic mechanisms that control interactions of organisms with their environment. However, QTL analyses frequently do not culminate in the identification of a causal gene due to the large chromosomal regions often underlying QTLs. A reasonable approach to inform the process of causal gene identification is to incorporate additional genome-wide information, which is becoming increasingly accessible. In this work, we perform QTL analysis of the shade avoidance response in the Bayreuth-0 (Bay-0, CS954) x Shahdara (Sha, CS929) recombinant inbred line population of Arabidopsis. We take advantage of the complex pleiotropic nature of this trait to perform network analysis using co-expression, eQTL and functional classification from publicly available datasets to help us find good candidate genes for our strongest QTL, SAR2. This novel network analysis detected EARLY FLOWERING 3 (ELF3; AT2G25930) as the most likely candidate gene affecting the shade avoidance response in our population. Further genetic and transgenic experiments confirmed ELF3 as the causative gene for SAR2. The Bay-0 and Sha alleles of ELF3 differentially regulate developmental time and circadian clock period length in Arabidopsis, and the extent of this regulation is dependent on the light environment. This is the first time that ELF3 has been implicated in the shade avoidance response and that different natural alleles of this gene are shown to have phenotypic effects. In summary, we show that development of networks to inform candidate gene identification for QTLs is a promising technique that can significantly accelerate the process of QTL cloning

    Pyrosequencing of the Camptotheca acuminata transcriptome reveals putative genes involved in camptothecin biosynthesis and transport

    Get PDF
    Background: Camptotheca acuminata is a Nyssaceae plant, often called the "happy tree", which is indigenous in Southern China. C. acuminata produces the terpenoid indole alkaloid, camptothecin (CPT), which exhibits clinical effects in various cancer treatments. Despite its importance, little is known about the transcriptome of C. acuminata and the mechanism of CPT biosynthesis, as only few nucleotide sequences are included in the GenBank database.Results: From a constructed cDNA library of young C. acuminata leaves, a total of 30,358 unigenes, with an average length of 403 bp, were obtained after assembly of 74,858 high quality reads using GS De Novo assembler software. Through functional annotation, a total of 21,213 unigenes were annotated at least once against the NCBI nucleotide (Nt), non-redundant protein (Nr), Uniprot/SwissProt, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Arabidopsis thaliana proteome (TAIR) databases. Further analysis identified 521 ESTs representing 20 enzyme genes that are involved in the backbone of the CPT biosynthetic pathway in the library. Three putative genes in the upstream pathway, including genes for geraniol-10-hydroxylase (CaPG10H), secologanin synthase (CaPSCS), and strictosidine synthase (CaPSTR) were cloned and analyzed. The expression level of the three genes was also detected using qRT-PCR in C. acuminata. With respect to the branch pathway of CPT synthesis, six cytochrome P450s transcripts were selected as candidate transcripts by detection of transcript expression in different tissues using qRT-PCR. In addition, one glucosidase gene was identified that might participate in CPT biosynthesis. For CPT transport, three of 21 transcripts for multidrug resistance protein (MDR) transporters were also screened from the dataset by their annotation result and gene expression analysis.Conclusion: This study produced a large amount of transcriptome data from C. acuminata by 454 pyrosequencing. According to EST annotation, catalytic features prediction, and expression analysis, novel putative transcripts involved in CPT biosynthesis and transport were discovered in C. acuminata. This study will facilitate further identification of key enzymes and transporter genes in C. acuminata

    Investigating the validity of current network analysis on static conglomerate networks by protein network stratification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A molecular network perspective forms the foundation of systems biology. A common practice in analyzing protein-protein interaction (PPI) networks is to perform network analysis on a conglomerate network that is an assembly of all available binary interactions in a given organism from diverse data sources. Recent studies on network dynamics suggested that this approach might have ignored the dynamic nature of context-dependent molecular systems.</p> <p>Results</p> <p>In this study, we employed a network stratification strategy to investigate the validity of the current network analysis on conglomerate PPI networks. Using the genome-scale tissue- and condition-specific proteomics data in <it>Arabidopsis thaliana</it>, we present here the first systematic investigation into this question. We stratified a conglomerate <it>A. thaliana </it>PPI network into three levels of context-dependent subnetworks. We then focused on three types of most commonly conducted network analyses, i.e., topological, functional and modular analyses, and compared the results from these network analyses on the conglomerate network and five stratified context-dependent subnetworks corresponding to specific tissues.</p> <p>Conclusions</p> <p>We found that the results based on the conglomerate PPI network are often significantly different from those of context-dependent subnetworks corresponding to specific tissues or conditions. This conclusion depends neither on relatively arbitrary cutoffs (such as those defining network hubs or bottlenecks), nor on specific network clustering algorithms for module extraction, nor on the possible high false positive rates of binary interactions in PPI networks. We also found that our conclusions are likely to be valid in human PPI networks. Furthermore, network stratification may help resolve many controversies in current research of systems biology.</p
    corecore