226 research outputs found

    Nephele: genotyping via complete composition vectors and MapReduce

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p> <p>Results</p> <p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p> <p>Conclusions</p> <p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p

    Fast Statistical Alignment

    Get PDF
    We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

    Fine-Scale Bacterial Beta Diversity within a Complex Ecosystem (Zodletone Spring, OK, USA): The Role of the Rare Biosphere

    Get PDF
    The adaptation of pyrosequencing technologies for use in culture-independent diversity surveys allowed for deeper sampling of ecosystems of interest. One extremely well suited area of interest for pyrosequencing-based diversity surveys that has received surprisingly little attention so far, is examining fine scale (e.g. micrometer to millimeter) beta diversity in complex microbial ecosystems.We examined the patterns of fine scale Beta diversity in four adjacent sediment samples (1mm apart) from the source of an anaerobic sulfide and sulfur rich spring (Zodletone spring) in southwestern Oklahoma, USA. Using pyrosequencing, a total of 292,130 16S rRNA gene sequences were obtained. The beta diversity patterns within the four datasets were examined using various qualitative and quantitative similarity indices. Low levels of Beta diversity (high similarity indices) were observed between the four samples at the phylum-level. However, at a putative species (OTU(0.03)) level, higher levels of beta diversity (lower similarity indices) were observed. Further examination of beta diversity patterns within dominant and rare members of the community indicated that at the putative species level, beta diversity is much higher within rare members of the community. Finally, sub-classification of rare members of Zodletone spring community based on patterns of novelty and uniqueness, and further examination of fine scale beta diversity of each of these subgroups indicated that members of the community that are unique, but non novel showed the highest beta diversity within these subgroups of the rare biosphere.The results demonstrate the occurrence of high inter-sample diversity within seemingly identical samples from a complex habitat. We reason that such unexpected diversity should be taken into consideration when exploring gamma diversity of various ecosystems, as well as planning for sequencing-intensive metagenomic surveys of highly complex ecosystems

    Human oral viruses are personal, persistent and gender-consistent.

    Get PDF
    Viruses are the most abundant members of the human oral microbiome, yet relatively little is known about their biodiversity in humans. To improve our understanding of the DNA viruses that inhabit the human oral cavity, we examined saliva from a cohort of eight unrelated subjects over a 60-day period. Each subject was examined at 11 time points to characterize longitudinal differences in human oral viruses. Our primary goals were to determine whether oral viruses were specific to individuals and whether viral genotypes persisted over time. We found a subset of homologous viral genotypes across all subjects and time points studied, suggesting that certain genotypes may be ubiquitous among healthy human subjects. We also found significant associations between viral genotypes and individual subjects, indicating that viruses are a highly personalized feature of the healthy human oral microbiome. Many of these oral viruses were not transient members of the oral ecosystem, as demonstrated by the persistence of certain viruses throughout the entire 60-day study period. As has previously been demonstrated for bacteria and fungi, membership in the oral viral community was significantly associated with the sex of each subject. Similar characteristics of personalized, sex-specific microflora could not be identified for oral bacterial communities based on 16S rRNA. Our findings that many viruses are stable and individual-specific members of the oral ecosystem suggest that viruses have an important role in the human oral ecosystem

    Integrated metatranscriptomic and metagenomic analyses of stratified microbial assemblages in the open ocean

    Get PDF
    As part of an ongoing survey of microbial community gene expression in the ocean, we sequenced and compared ~38 Mbp of community transcriptomes and ~157 Mbp of community genomes from four bacterioplankton samples, along a defined depth profile at Station ALOHA in North Pacific subtropical gyre (NPSG). Taxonomic analysis suggested that the samples were dominated by three taxa: Prochlorales, Consistiales and Cenarchaeales, which comprised 36–69% and 29–63% of the annotated sequences in the four DNA and four cDNA libraries, respectively. The relative abundance of these taxonomic groups was sometimes very different in the DNA and cDNA libraries, suggesting differential relative transcriptional activities per cell. For example, the 125 m sample genomic library was dominated by Pelagibacter (~36% of sequence reads), which contributed fewer sequences to the community transcriptome (~11%). Functional characterization of highly expressed genes suggested taxon-specific contributions to specific biogeochemical processes. Examples included Roseobacter relatives involved in aerobic anoxygenic phototrophy at 75 m, and an unexpected contribution of low abundance Crenarchaea to ammonia oxidation at 125 m. Read recruitment using reference microbial genomes indicated depth-specific partitioning of coexisting microbial populations, highlighted by a transcriptionally active high-light-like Prochlorococcus population in the bottom of the photic zone. Additionally, nutrient-uptake genes dominated Pelagibacter transcripts, with apparent enrichment for certain transporter types (for example, the C4-dicarboxylate transport system) over others (for example, phosphate transporters). In total, the data support the utility of coupled DNA and cDNA analyses for describing taxonomic and functional attributes of microbial communities in their natural habitats.Gordon and Betty Moore FoundationUnited States. Dept. of EnergyNational Science Foundation (U.S.) (Science and Technology Center Award EF0424599

    Active inference, sensory attenuation and illusions.

    Get PDF
    Active inference provides a simple and neurobiologically plausible account of how action and perception are coupled in producing (Bayes) optimal behaviour. This can be seen most easily as minimising prediction error: we can either change our predictions to explain sensory input through perception. Alternatively, we can actively change sensory input to fulfil our predictions. In active inference, this action is mediated by classical reflex arcs that minimise proprioceptive prediction error created by descending proprioceptive predictions. However, this creates a conflict between action and perception; in that, self-generated movements require predictions to override the sensory evidence that one is not actually moving. However, ignoring sensory evidence means that externally generated sensations will not be perceived. Conversely, attending to (proprioceptive and somatosensory) sensations enables the detection of externally generated events but precludes generation of actions. This conflict can be resolved by attenuating the precision of sensory evidence during movement or, equivalently, attending away from the consequences of self-made acts. We propose that this Bayes optimal withdrawal of precise sensory evidence during movement is the cause of psychophysical sensory attenuation. Furthermore, it explains the force-matching illusion and reproduces empirical results almost exactly. Finally, if attenuation is removed, the force-matching illusion disappears and false (delusional) inferences about agency emerge. This is important, given the negative correlation between sensory attenuation and delusional beliefs in normal subjects--and the reduction in the magnitude of the illusion in schizophrenia. Active inference therefore links the neuromodulatory optimisation of precision to sensory attenuation and illusory phenomena during the attribution of agency in normal subjects. It also provides a functional account of deficits in syndromes characterised by false inference and impaired movement--like schizophrenia and Parkinsonism--syndromes that implicate abnormal modulatory neurotransmission

    Do physician outcome judgments and judgment biases contribute to inappropriate use of treatments? Study protocol

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There are many examples of physicians using treatments inappropriately, despite clear evidence about the circumstances under which the benefits of such treatments outweigh their harms. When such over- or under- use of treatments occurs for common diseases, the burden to the healthcare system and risks to patients can be substantial. We propose that a major contributor to inappropriate treatment may be how clinicians judge the likelihood of important treatment outcomes, and how these judgments influence their treatment decisions. The current study will examine the role of judged outcome probabilities and other cognitive factors in the context of two clinical treatment decisions: 1) prescription of antibiotics for sore throat, where we hypothesize overestimation of benefit and underestimation of harm leads to over-prescription of antibiotics; and 2) initiation of anticoagulation for patients with atrial fibrillation (AF), where we hypothesize that underestimation of benefit and overestimation of harm leads to under-prescription of warfarin.</p> <p>Methods</p> <p>For each of the two conditions, we will administer surveys of two types (Type 1 and Type 2) to different samples of Canadian physicians. The primary goal of the Type 1 survey is to assess physicians' perceived outcome probabilities (both good and bad outcomes) for the target treatment. Type 1 surveys will assess judged outcome probabilities in the context of a representative patient, and include questions about how physicians currently treat such cases, the recollection of rare or vivid outcomes, as well as practice and demographic details. The primary goal of the Type 2 surveys is to measure the specific factors that drive individual clinical judgments and treatment decisions, using a 'clinical judgment analysis' or 'lens modeling' approach. This survey will manipulate eight clinical variables across a series of sixteen realistic case vignettes. Based on the survey responses, we will be able to identify which variables have the greatest effect on physician judgments, and whether judgments are affected by inappropriate cues or incorrect weighting of appropriate cues. We will send antibiotics surveys to family physicians (300 per survey), and warfarin surveys to both family physicians and internal medicine specialists (300 per group per survey), for a total of 1,800 physicians. Each Type 1 survey will be two to four pages in length and take about fifteen minutes to complete, while each Type 2 survey will be eight to ten pages in length and take about thirty minutes to complete.</p> <p>Discussion</p> <p>This work will provide insight into the extent to which clinicians' judgments about the likelihood of important treatment outcomes explain inappropriate treatment decisions. This work will also provide information necessary for the development of an individualized feedback tool designed to improve treatment decisions. The techniques developed here have the potential to be applicable to a wide range of clinical areas where inappropriate utilization stems from biased judgments.</p

    Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

    Get PDF
    Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys
    corecore