5 research outputs found

    Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

    Get PDF
    Biological databases represent an extraordinary collective volume of work. Diligently built up over decades and comprising many millions of contributions from the biomedical research community, biological databases provide worldwide access to a massive number of records (also known as entries) [1]. Starting from individual laboratories, genomes are sequenced, assembled, annotated, and ultimately submitted to primary nucleotide databases such as GenBank [2], European Nucleotide Archive (ENA) [3], and DNA Data Bank of Japan (DDBJ) [4] (collectively known as the International Nucleotide Sequence Database Collaboration, INSDC). Protein records, which are the translations of these nucleotide records, are deposited into central protein databases such as the UniProt KnowledgeBase (UniProtKB) [5] and the Protein Data Bank (PDB) [6]. Sequence records are further accumulated into different databases for more specialized purposes: RFam [7] and PFam [8] for RNA and protein families, respectively; DictyBase [9] and PomBase [10] for model organisms; as well as ArrayExpress [11] and Gene Expression Omnibus (GEO) [12] for gene expression profiles. These databases are selected as examples; the list is not intended to be exhaustive. However, they are representative of biological databases that have been named in the “golden set” of the 24th Nucleic Acids Research database issue (in 2016). The introduction of that issue highlights the databases that “consistently served as authoritative, comprehensive, and convenient data resources widely used by the entire community and offer some lessons on what makes a successful database” [13]. In addition, the associated information about sequences is also propagated into non-sequence databases, such as PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) for scientific literature or Gene Ontology (GO) [14] for function annotations. These databases in turn benefit individual studies, many of which use these publicly available records as the basis for their own research

    Patterns of genomic differentiation between two Lake Victoria cichlid species, Haplochromis pyrrhocephalus and H. sp. ‘macula’

    No full text
    Abstract Background The molecular basis of the incipient stage of speciation is still poorly understood. Cichlid fish species in Lake Victoria are a prime example of recent speciation events and a suitable system to study the adaptation and reproductive isolation of species. Results Here, we report the pattern of genomic differentiation between two Lake Victoria cichlid species collected in sympatry, Haplochromis pyrrhocephalus and H. sp. ‘macula,’ based on the pooled genome sequences of 20 individuals of each species. Despite their ecological differences, population genomics analyses demonstrate that the two species are very close to a single panmictic population due to extensive gene flow. However, we identified 21 highly differentiated short genomic regions with fixed nucleotide differences. At least 15 of these regions contained genes with predicted roles in adaptation and reproductive isolation, such as visual adaptation, circadian clock, developmental processes, adaptation to hypoxia, and sexual selection. The nonsynonymous fixed differences in one of these genes, LWS, were reported as substitutions causing shift in absorption spectra of LWS pigments. Fixed differences were found in the promoter regions of four other differentially expressed genes, indicating that these substitutions may alter gene expression levels. Conclusions These diverged short genomic regions may have contributed to the differentiation of two ecologically different species. Moreover, the origins of adaptive variants within the differentiated regions predate the geological formation of Lake Victoria; thus Lake Victoria cichlid species diversified via selection on standing genetic variation
    corecore