101 research outputs found

    Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data

    Get PDF
    Huang L. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: UniversitĂ€t Bielefeld; 2019.The increasing amount of next-generation sequencing data introduces a fundamental challenge on large scale genomic analytics. Storing and processing large amounts of sequencing data requires considerable hardware resources and efficient software that can fully utilize these resources. Nowadays, both industrial enterprises and nonprofit institutes are providing robust and easy-access cloud services for studies in life science. To facilitate genomic data analyses on such powerful computing resources, distributed bioinformatics tools are needed. However, most of existing tools have low scalability on the distributed computing cloud. Thus, in this thesis, I developed a cloud based bioinformatics framework that mainly addresses two computational challenges: (i) the run time intensive challenge in the sequence mapping process and (ii) the memory intensive challenge in the de novo genome assembly process. For sequence mapping, I have natively implemented an Apache Spark based distributed sequence mapping tool called Sparkhit. It uses the q-gram filter and Pigeonhole principle to accelerate the speeds of fragment recruitment and short read mapping processes. These algorithms are implemented in the Spark extended MapReduce model. Sparkhit runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing. For de novo genome assembly, I have invented a new data structure called Reflexible Distributed K-mer (RDK) and natively implemented a distributed genome assembler called Reflexiv. Reflexiv is built on top of the Apache Spark platform, uses Spark Resilient Distributed Dataset (RDD) to distributed large amount of k-mers across the cluster and assembles the genome in a recursive way. As a result, Reflexiv runs 8-17 times faster than Ray assembler and 5-18 times faster than AbySS assembler on the clusters deployed at the de.NBI cloud. In addition, I have incorporated a variety of analytical methods into the framework. I have also developed a tool wrapper to distribute external tools and Docker containers on the Spark cluster. As a large scale genomic use case, my framework processed 100 terabytes of data across four genomic projects on the Amazon cloud in 21 hours. Furthermore, the application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 hours, presenting an approach to easily associate large amounts of public datasets with reference data. Thus, my work contributes to the interdisciplinary research of life science and distributed cloud computing by improving existing methods with a new data structure, new algorithms, and robust distributed implementations

    Molecular conversions of cyclic sulfur compounds utilizing the features of sulfer

    Get PDF
    Thesis--University of Tsukuba, D.Sc.(A), no. 649, 1989. 3. 2

    The Devil of Face Recognition is in the Noise

    Full text link
    The growing scale of face recognition datasets empowers us to train strong convolutional networks for face recognition. While a variety of architectures and loss functions have been devised, we still have a limited understanding of the source and consequence of label noise inherent in existing datasets. We make the following contributions: 1) We contribute cleaned subsets of popular face databases, i.e., MegaFace and MS-Celeb-1M datasets, and build a new large-scale noise-controlled IMDb-Face dataset. 2) With the original datasets and cleaned subsets, we profile and analyze label noise properties of MegaFace and MS-Celeb-1M. We show that a few orders more samples are needed to achieve the same accuracy yielded by a clean subset. 3) We study the association between different types of noise, i.e., label flips and outliers, with the accuracy of face recognition models. 4) We investigate ways to improve data cleanliness, including a comprehensive user study on the influence of data labeling strategies to annotation accuracy. The IMDb-Face dataset has been released on https://github.com/fwang91/IMDb-Face.Comment: accepted to ECCV'1

    The novel oligopeptide utilizing species Anaeropeptidivorans aminofermentans M3/9T, its role in anaerobic digestion and occurrence as deduced from large-scale fragment recruitment analyses

    Get PDF
    Research on biogas-producing microbial communities aims at elucidation of correlations and dependencies between the anaerobic digestion (AD) process and the corresponding microbiome composition in order to optimize the performance of the process and the biogas output. Previously, Lachnospiraceae species were frequently detected in mesophilic to moderately thermophilic biogas reactors. To analyze adaptive genome features of a representative Lachnospiraceae strain, Anaeropeptidivorans aminofermentans M3/9T was isolated from a mesophilic laboratory-scale biogas plant and its genome was sequenced and analyzed in detail. Strain M3/9T possesses a number of genes encoding enzymes for degradation of proteins, oligo- and dipeptides. Moreover, genes encoding enzymes participating in fermentation of amino acids released from peptide hydrolysis were also identified. Based on further findings obtained from metabolic pathway reconstruction, M3/9T was predicted to participate in acidogenesis within the AD process. To understand the genomic diversity between the biogas isolate M3/9T and closely related Anaerotignum type strains, genome sequence comparisons were performed. M3/9T harbors 1,693 strain-specific genes among others encoding different peptidases, a phosphotransferase system (PTS) for sugar uptake, but also proteins involved in extracellular solute binding and import, sporulation and flagellar biosynthesis. In order to determine the occurrence of M3/9T in other environments, large-scale fragment recruitments with the M3/9T genome as a template and publicly available metagenomes representing different environments was performed. The strain was detected in the intestine of mammals, being most abundant in goat feces, occasionally used as a substrate for biogas production.Peer Reviewe

    Simultaneous Metabarcoding and Quantification of Neocallimastigomycetes from Environmental Samples:Insights into Community Composition and Novel Lineages

    Get PDF
    Anaerobic fungi from the herbivore digestive tract (Neocallimastigomycetes) are primary lignocellulose modifiers and hold promise for biotechnological applications. Their molecular detection is currently difficult due to the non-specificity of published primer pairs, which impairs evolutionary and ecological research with environmental samples. We developed and validated a Neocallimastigomycetes-specific PCR primer pair targeting the D2 region of the ribosomal large subunit suitable for screening, quantifying, and sequencing. We evaluated this primer pair in silico on sequences from all known genera, in vitro with pure cultures covering 16 of the 20 known genera, and on environmental samples with highly diverse microbiomes. The amplified region allowed phylogenetic differentiation of all known genera and most species. The amplicon is about 350 bp long, suitable for short-read high-throughput sequencing as well as qPCR assays. Sequencing of herbivore fecal samples verified the specificity of the primer pair and recovered highly diverse and so far unknown anaerobic gut fungal taxa. As the chosen barcoding region can be easily aligned and is taxonomically informative, the sequences can be used for classification and phylogenetic inferences. Several new Neocallimastigomycetes clades were obtained, some of which represent putative novel lineages such as a clade from feces of the rodent Dolichotis patagonum (mara)

    Simultaneous Metabarcoding and Quantification of Neocallimastigomycetes from Environmental Samples:Insights into Community Composition and Novel Lineages

    Get PDF
    Anaerobic fungi from the herbivore digestive tract (Neocallimastigomycetes) are primary lignocellulose modifiers and hold promise for biotechnological applications. Their molecular detection is currently difficult due to the non-specificity of published primer pairs, which impairs evolutionary and ecological research with environmental samples. We developed and validated a Neocallimastigomycetes-specific PCR primer pair targeting the D2 region of the ribosomal large subunit suitable for screening, quantifying, and sequencing. We evaluated this primer pair in silico on sequences from all known genera, in vitro with pure cultures covering 16 of the 20 known genera, and on environmental samples with highly diverse microbiomes. The amplified region allowed phylogenetic differentiation of all known genera and most species. The amplicon is about 350 bp long, suitable for short-read high-throughput sequencing as well as qPCR assays. Sequencing of herbivore fecal samples verified the specificity of the primer pair and recovered highly diverse and so far unknown anaerobic gut fungal taxa. As the chosen barcoding region can be easily aligned and is taxonomically informative, the sequences can be used for classification and phylogenetic inferences. Several new Neocallimastigomycetes clades were obtained, some of which represent putative novel lineages such as a clade from feces of the rodent Dolichotis patagonum (mara)
    • 

    corecore