59 research outputs found

    Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

    Get PDF
    BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project

    A population-based study of asthma, quality of life, and occupation among elderly Hispanic and non-Hispanic whites: a cross-sectional investigation

    Get PDF
    BACKGROUND: The U.S. population is aging and is expected to double by the year 2030. The current study evaluated the prevalence of asthma and its correlates in the elderly Hispanic and non-Hispanic white population. METHODS: Data from a sample of 3021 Hispanics and non-Hispanic White subjects, 65 years and older, interviewed as part of an ongoing cross-sectional study of the elderly in west Texas, were analyzed. The outcome variable was categorized into: no asthma (reference category), current asthma, and probable asthma. Polytomous logistic regression analysis was used to assess the relationship between the outcome variable and various socio-demographic measures, self-rated health, asthma symptoms, quality of life measures (SF-12), and various occupations. RESULTS: The estimated prevalence of current asthma and probable asthma were 6.3% (95%CI: 5.3–7.2) and 9.0% (95%CI: 7.8–10.1) respectively. The majority of subjects with current asthma (Mean SF-12 score 35.8, 95%CI: 34.2–37.4) or probable asthma (35.3, 34.0–36.6) had significantly worse physical health-related quality of life as compared to subjects without asthma (42.6, 42.1–43.1). In multiple logistic regression analyses, women had a 1.64 times greater odds of current asthma (95%CI: 1.12–2.38) as compared to men. Hay fever was a strong predictor of both current and probable asthma. The odds of current asthma were 1.78 times (95%CI: 1.24–2.55) greater among past smokers; whereas the odds of probable asthma were 2.73 times (95%CI: 1.77–4.21) greater among current smokers as compared to non-smokers. Similarly fair/poor self rated health and complaints of severe pain were independently associated with current and probable asthma. The odds of current and probable asthma were almost two fold greater for obesity. When stratified by gender, the odds were significantly greater among females (p-value for interaction term = 0.038). The odds of current asthma were significantly greater for farm-related occupations (adjusted OR = 2.09, 95%CI: 1.00–4.39); whereas the odds were significantly lower among those who reported teaching as their longest held occupation (adjusted OR = 0.36, 95%CI = 0.18–0.74). CONCLUSION: This study found that asthma is a common medical condition in the elderly and it significantly impacts quality of life and general health status. Results support adopting an integrated approach in identifying and controlling asthma in this population

    Contribution of Exogenous Genetic Elements to the Group A Streptococcus Metagenome

    Get PDF
    Variation in gene content among strains of a bacterial species contributes to biomedically relevant differences in phenotypes such as virulence and antimicrobial resistance. Group A Streptococcus (GAS) causes a diverse array of human infections and sequelae, and exhibits a complex pathogenic behavior. To enhance our understanding of genotype-phenotype relationships in this important pathogen, we determined the complete genome sequences of four GAS strains expressing M protein serotypes (M2, M4, and 2 M12) that commonly cause noninvasive and invasive infections. These sequences were compared with eight previously determined GAS genomes and regions of variably present gene content were assessed. Consistent with the previously determined genomes, each of the new genomes is ∼1.9 Mb in size, with ∼10% of the gene content of each encoded on variably present exogenous genetic elements. Like the other GAS genomes, these four genomes are polylysogenic and prophage encode the majority of the variably present gene content of each. In contrast to most of the previously determined genomes, multiple exogenous integrated conjugative elements (ICEs) with characteristics of conjugative transposons and plasmids are present in these new genomes. Cumulatively, 242 new GAS metagenome genes were identified that were not present in the previously sequenced genomes. Importantly, ICEs accounted for 41% of the new GAS metagenome gene content identified in these four genomes. Two large ICEs, designated 2096-RD.2 (63 kb) and 10750-RD.2 (49 kb), have multiple genes encoding resistance to antimicrobial agents, including tetracycline and erythromycin, respectively. Also resident on these ICEs are three genes encoding inferred extracellular proteins of unknown function, including a predicted cell surface protein that is only present in the genome of the serotype M12 strain cultured from a patient with acute poststreptococcal glomerulonephritis. The data provide new information about the GAS metagenome and will assist studies of pathogenesis, antimicrobial resistance, and population genomics

    Strengths and limitations of period estimation methods for circadian data

    Get PDF
    A key step in the analysis of circadian data is to make an accurate estimate of the underlying period. There are many different techniques and algorithms for determining period, all with different assumptions and with differing levels of complexity. Choosing which algorithm, which implementation and which measures of accuracy to use can offer many pitfalls, especially for the non-expert. We have developed the BioDare system, an online service allowing data-sharing (including public dissemination), data-processing and analysis. Circadian experiments are the main focus of BioDare hence performing period analysis is a major feature of the system. Six methods have been incorporated into BioDare: Enright and Lomb-Scargle periodograms, FFT-NLLS, mFourfit, MESA and Spectrum Resampling. Here we review those six techniques, explain the principles behind each algorithm and evaluate their performance. In order to quantify the methods' accuracy, we examine the algorithms against artificial mathematical test signals and model-generated mRNA data. Our re-implementation of each method in Java allows meaningful comparisons of the computational complexity and computing time associated with each algorithm. Finally, we provide guidelines on which algorithms are most appropriate for which data types, and recommendations on experimental design to extract optimal data for analysis

    Genome Stability of Lyme Disease Spirochetes: Comparative Genomics of Borrelia burgdorferi Plasmids

    Get PDF
    Lyme disease is the most common tick-borne human illness in North America. In order to understand the molecular pathogenesis, natural diversity, population structure and epizootic spread of the North American Lyme agent, Borrelia burgdorferi sensu stricto, a much better understanding of the natural diversity of its genome will be required. Towards this end we present a comparative analysis of the nucleotide sequences of the numerous plasmids of B. burgdorferi isolates B31, N40, JD1 and 297. These strains were chosen because they include the three most commonly studied laboratory strains, and because they represent different major genetic lineages and so are informative regarding the genetic diversity and evolution of this organism. A unique feature of Borrelia genomes is that they carry a large number of linear and circular plasmids, and this work shows that strains N40, JD1, 297 and B31 carry related but non-identical sets of 16, 20, 19 and 21 plasmids, respectively, that comprise 33–40% of their genomes. We deduce that there are at least 28 plasmid compatibility types among the four strains. The B. burgdorferi ∼900 Kbp linear chromosomes are evolutionarily exceptionally stable, except for a short ≀20 Kbp plasmid-like section at the right end. A few of the plasmids, including the linear lp54 and circular cp26, are also very stable. We show here that the other plasmids, especially the linear ones, are considerably more variable. Nearly all of the linear plasmids have undergone one or more substantial inter-plasmid rearrangements since their last common ancestor. In spite of these rearrangements and differences in plasmid contents, the overall gene complement of the different isolates has remained relatively constant

    The discovery, distribution, and evolution of viruses associated with drosophila melanogaster

    Get PDF
    Drosophila melanogaster is a valuable invertebrate model for viral infection and antiviral immunity, and is a focus for studies of insect-virus coevolution. Here we use a metagenomic approach to identify more than 20 previously undetected RNA viruses and a DNA virus associated with wild D. melanogaster. These viruses not only include distant relatives of known insect pathogens, but also novel groups of insect-infecting viruses. By sequencing virus-derived small RNAs we show that the viruses represent active infections of Drosophila. We find that the RNA viruses differ in the number and properties of their small RNAs, and we detect both siRNAs and a novel miRNA from the DNA virus. Analysis of small RNAs also allows us to identify putative viral sequences that lack detectable sequence similarity to known viruses. By surveying >2000 individually collected wild adult Drosophila we show that more than 30% of D. melanogaster carry a detectable virus, and more than 6% carry multiple viruses. However, despite a high prevalence of the Wolbachia endosymbiontβ€”which is known to be protective against virus infections in Drosophilaβ€”we were unable to detect any relationship between the presence of Wolbachia and the presence of any virus. Using publicly available RNA-seq datasets we show that the community of viruses in Drosophila laboratories is very different from that seen in the wild, but that some of the newly discovered viruses are nevertheless widespread in laboratory lines and are ubiquitous in cell culture. By sequencing viruses from individual wild-collected flies we show that some viruses are shared between D. melanogaster and D. simulans. Our results provide an essential evolutionary and ecological context for host-virus interaction in Drosophila, and the newly reported viral sequences will help develop D. melanogaster further as a model for molecular and evolutionary virus research
    • …
    corecore