201 research outputs found

    XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis

    Get PDF
    BACKGROUND: Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. DESCRIPTION: Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. CONCLUSION: The results of the analysis have been stored in a publicly available database XenDB . A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at

    MeCorS: Metagenome-enabled error correction of single cell sequencing reads

    Get PDF
    Bremges A, Singer E, Woyke T, Sczyrba A. MeCorS: Metagenome-enabled error correction of single cell sequencing reads. Bioinformatics. 2016;32(14):2199-2201.UNLABELLED: We present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies. AVAILABILITY AND IMPLEMENTATION: https://github.com/metagenomics/MeCorS CONTACT: [email protected] SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. The Author 2016. Published by Oxford University Press

    IsoSVM – Distinguishing isoforms and paralogs on the protein level

    Get PDF
    BACKGROUND: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not. RESULTS: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution. CONCLUSION: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM

    Bioboxes: standardised containers for interchangeable bioinformatics software

    Get PDF
    Belmann P, Dröge J, Bremges A, McHardy AC, Sczyrba A, Barton MD. Bioboxes: standardised containers for interchangeable bioinformatics software. GigaScience. 2015;4(1): 47.Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable

    Single-cell genomics reveal low recombination frequencies in freshwater bacteria of the SAR11 clade

    Get PDF
    Zaremba-Niedzwiedzka K, Viklund J, Zhao W, et al. Single-cell genomics reveal low recombination frequencies in freshwater bacteria of the SAR11 clade. Genome biology. 2013;14(11): R130.BACKGROUND: The SAR11 group of Alphaproteobacteria is highly abundant in the oceans. It contains a recently diverged freshwater clade, which offers the opportunity to compare adaptations to salt- and freshwaters in a monophyletic bacterial group. However, there are no cultivated members of the freshwater SAR11 group and no genomes have been sequenced yet. RESULTS: We isolated ten single SAR11 cells from three freshwater lakes and sequenced and assembled their genomes. A phylogeny based on 57 proteins indicates that the cells are organized into distinct microclusters. We show that the freshwater genomes have evolved primarily by the accumulation of nucleotide substitutions and that they have among the lowest ratio of recombination to mutation estimated for bacteria. In contrast, members of the marine SAR11 clade have one of the highest ratios. Additional metagenome reads from six lakes confirm low recombination frequencies for the genome overall and reveal lake-specific variations in microcluster abundances. We identify hypervariable regions with gene contents broadly similar to those in the hypervariable regions of the marine isolates, containing genes putatively coding for cell surface molecules. CONCLUSIONS: We conclude that recombination rates differ dramatically in phylogenetic sister groups of the SAR11 clade adapted to freshwater and marine ecosystems. The results suggest that the transition from marine to freshwater systems has purged diversity and resulted in reduced opportunities for recombination with divergent members of the clade. The low recombination frequencies of the LD12 clade resemble the low genetic divergence of host-restricted pathogens that have recently shifted to a new host

    The novel oligopeptide utilizing species Anaeropeptidivorans aminofermentans M3/9T, its role in anaerobic digestion and occurrence as deduced from large-scale fragment recruitment analyses

    Get PDF
    Research on biogas-producing microbial communities aims at elucidation of correlations and dependencies between the anaerobic digestion (AD) process and the corresponding microbiome composition in order to optimize the performance of the process and the biogas output. Previously, Lachnospiraceae species were frequently detected in mesophilic to moderately thermophilic biogas reactors. To analyze adaptive genome features of a representative Lachnospiraceae strain, Anaeropeptidivorans aminofermentans M3/9T was isolated from a mesophilic laboratory-scale biogas plant and its genome was sequenced and analyzed in detail. Strain M3/9T possesses a number of genes encoding enzymes for degradation of proteins, oligo- and dipeptides. Moreover, genes encoding enzymes participating in fermentation of amino acids released from peptide hydrolysis were also identified. Based on further findings obtained from metabolic pathway reconstruction, M3/9T was predicted to participate in acidogenesis within the AD process. To understand the genomic diversity between the biogas isolate M3/9T and closely related Anaerotignum type strains, genome sequence comparisons were performed. M3/9T harbors 1,693 strain-specific genes among others encoding different peptidases, a phosphotransferase system (PTS) for sugar uptake, but also proteins involved in extracellular solute binding and import, sporulation and flagellar biosynthesis. In order to determine the occurrence of M3/9T in other environments, large-scale fragment recruitments with the M3/9T genome as a template and publicly available metagenomes representing different environments was performed. The strain was detected in the intestine of mammals, being most abundant in goat feces, occasionally used as a substrate for biogas production.Peer Reviewe

    Characterization of Bathyarchaeota genomes assembled from metagenomes of biofilms residing in mesophilic and thermophilic biogas reactors

    Get PDF
    Maus I, Rumming M, Bergmann I, et al. Characterization of Bathyarchaeota genomes assembled from metagenomes of biofilms residing in mesophilic and thermophilic biogas reactors. Biotechnology for Biofuels. 2018;11(1): 167

    Common ELIXIR Service for Researcher Authentication and Authorisation

    Get PDF
    Linden M, Prochazka M, Lappalainen I, et al. Common ELIXIR Service for Researcher Authentication and Authorisation. F1000Research. 2018;7: 1199.A common Authentication and Authorisation Infrastructure (AAI) that would allow single sign-on to services has been identified as a key enabler for European bioinformatics. ELIXIR AAI is an ELIXIR service portfolio for authenticating researchers to ELIXIR services and assisting these services on user privileges during research usage. It relieves the scientific service providers from managing the user identities and authorisation themselves, enables the researcher to have a single set of credentials to all ELIXIR services and supports meeting the requirements imposed by the data protection laws. ELIXIR AAI was launched in late 2016 and is part of the ELIXIR Compute platform portfolio. By the end of 2017 the number of users reached 1000, while the number of relying scientific services was 36. This paper presents the requirements and design of the ELIXIR AAI and the policies related to its use, and how it can be used for serving some example services, such as document management, social media, data discovery, human data access, cloud compute and training services

    acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data

    Get PDF
    Lux M, Krüger J, Rinke C, et al. acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data. BMC Bioinformatics. 2016;17(1): 543.Background A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data. Results We present acdc, a tool specifically developed to aid the quality control process of genomic sequence data. By combining supervised and unsupervised methods, it reliably detects both known and de novo contaminants. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample. Furthermore, given the data complexity and the ill-posedness of clustering, acdc employs bootstrapping techniques to provide statistically profound confidence values. Tested on a large number of samples from diverse sequencing projects, our software is able to quickly and accurately identify contamination. Results are displayed in an interactive user interface. Acdc can be run from the web as well as a dedicated command line application, which allows easy integration into large sequencing project analysis workflows. Conclusions Acdc can reliably detect contamination in single-cell genome data. In addition to database-driven detection, it complements existing tools by its unsupervised techniques, which allow for the detection of de novo contaminants. Our contribution has the potential to drastically reduce the amount of resources put into these processes, particularly in the context of limited availability of reference species. As single-cell genome data continues to grow rapidly, acdc adds to the toolkit of crucial quality assurance tools
    • …
    corecore