14 research outputs found
ProDeGe: a computational protocol for fully automated decontamination of genomes
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies. On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). The procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence
Rare and low-frequency coding variants alter human adult height
Height is a highly heritable, classic polygenic trait with ~700 common associated variants identified so far through genome - wide association studies . Here , we report 83 height - associated coding variants with lower minor allele frequenc ies ( range of 0.1 - 4.8% ) and effects of up to 2 16 cm /allele ( e.g. in IHH , STC2 , AR and CRISPLD2 ) , >10 times the average effect of common variants . In functional follow - up studies, rare height - increasing alleles of STC2 (+1 - 2 cm/allele) compromise d proteolytic inhibition of PAPP - A and increased cleavage of IGFBP - 4 in vitro , resulting in higher bioavailability of insulin - like growth factors . The se 83 height - associated variants overlap genes mutated in monogenic growth disorders and highlight new biological candidates ( e.g. ADAMTS3, IL11RA, NOX4 ) and pathways ( e.g . proteoglycan/ glycosaminoglycan synthesis ) involved in growth . Our results demonstrate that sufficiently large sample sizes can uncover rare and low - frequency variants of moderate to large effect associated with polygenic human phenotypes , and that these variants implicate relevant genes and pathways
Recommended from our members
Automated Single Cell Data Decontamination Pipeline
Recent technological advancements in single-cell genomics have encouraged the classification and functional assessment of microorganisms from a wide span of the biospheres phylogeny.1,2 Environmental processes of interest to the DOE, such as bioremediation and carbon cycling, can be elucidated through the genomic lens of these unculturable microbes. However, contamination can occur at various stages of the single-cell sequencing process. Contaminated data can lead to wasted time and effort on meaningless analyses, inaccurate or erroneous conclusions, and pollution of public databases. A fully automated decontamination tool is necessary to prevent these instances and increase the throughput of the single-cell sequencing proces
Clusters and superclusters of phased small RNAs in the developing inflorescence of rice
To address the role of small regulatory RNAs in rice development, we generated a large data set of small RNAs from mature leaves and developing roots, shoots, and inflorescences. Using a spatial clustering algorithm, we identified 36,780 genomic groups of small RNAs. Most consisted of 24-nt RNAs that are expressed in all four tissues and enriched in repeat regions of the genome; 1029 clusters were composed primarily of 21-nt small RNAs and, strikingly, 831 of these contained phased RNAs and were preferentially expressed in developing inflorescences. Thirty-eight of the 24-mer clusters were also phased and preferentially expressed in inflorescences. The phased 21-mer clusters derive from nonprotein coding, nonrepeat regions of the genome and are grouped together into superclusters containing 10–46 clusters. The majority of these 21-mer clusters (705/831) are flanked by a degenerate 22-nt motif that is offset by 12 nt from the main phase of the cluster. Small RNAs complementary to these flanking 22-nt motifs define a new miRNA family, which is conserved in maize and expressed in developing reproductive tissues in both plants. These results suggest that the biogenesis of phased inflorescence RNAs resembles that of tasiRNAs and raise the possibility that these novel small RNAs function in early reproductive development in rice and other monocots
Recommended from our members
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4).
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4).
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases
IMG/M: integrated genome and metagenome comparative data analysis system.
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system
IMG/M 4 version of the integrated metagenome comparative analysis system
IMG/M (http://img.jgi.doe.gov/m) provides support for comparative analysis of microbial community aggregate genomes (metagenomes) in the context of a comprehensive set of reference genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG/M’s data content and analytical tools have expanded continuously since its first version was released in 2007. Since the last report published in the 2012 NAR Database Issue, IMG/M’s database architecture, annotation and data integration pipelines and analysis tools have been extended to copewith the rapid growth in the number and size of metagenome data sets handled by the system. IMG/M data marts provide support for the analysis of publicly available genomes, expert review of metagenome annotations (IMG/M ER: http://img.jgi.doe.gov/mer) and Human Microbiome Project (HMP)-specific metagenome samples (IMG/M HMP: http://img.jgi.doe.gov/imgm_hmp)