13 research outputs found

    ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

    Get PDF
    MOTIVATION: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. RESULTS: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. AVAILABILITY: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware

    Accurate and fast taxonomic profiling of microbial communities

    No full text
    With the advent of next generation sequencing there has been an explosion of the size of data that needs to be processed, where next generation sequencing yields basepairs of DNA in the millions. The rate at which the size of data increases supersedes Moores law therefore there is a huge demand for methods to nd meaningful labels of sequenced data. Studies of microbial diversity of a sample is one such challenge in the eld of metagenomics. Finding the distribution of a bacterial community has many uses for example, obesity control. Existing methods often resort to read-by-read classication which can take several days of computing time in a regular desktop environment, excluding genomic scientists without access to huge clusters of computational units. By using sparsity enforcing methods from the general sparse signal processing eld (such as compressed sensing), solutions have been found to the bacterial community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. The inference task is reduced to a general statistical model based on kernel density estimation techniques that are solved by existing convex optimization tools. The objective is to o er a reasonably fast community composition estimation method. This report proposes, clustering as a means of aggregating data to improve existing techniques run-time and biological delity. Use of convex optimization tools to increase the accuracy of mixture model parameters are also explored and tested. The work is concluded by experimentation on proposed improvements with satisfactory results. The use of Dirichlet mixtures is explored as a parametric model of the sample distribution where it is deemed that the Dirichlet is a good choice for aggregation of k-mer feature vectors but the use of Expectation Maximization is unt for parameter estimation of bacterial 16s rRNA samples. Finally, a semi-supervised learning method found on distance based classication of taxa has been implemented and tested on real biological data with high biological delity.Nya tekniker inom DNA-sekvensering har givit upphov till en explosion pa data som nns att tillga. Nasta generations DNA-sekvensering generar baspar som stracker sig i miljonerna och mangden data okas i en exponentiell takt, vilket ar varfor det nns ett stort behov av ny skalbar metodik som kan analysera kvantitiv data for att fa ut relevant information. Den bakteriella artfordelning av ett provror ar en sadan problemst allning inom meta-genomik, vilket har era tillampningsomraden som exempelvis, studier av fettma. I dagslaget sa ar den vanligaste metoden for att fa ut artfordelningen genom att klassiera DNA-strangarna av bakterierna, vilket ar en tidskravande losning som kan ta upp emot ett dygn for att processera data med hog upplosning. En snabb och tillforlitlig losning skulle darfor tillata er forskare att ta del av nasta generations sekvensering och analysera dess data som i sin tur skulle ge upphov till mer innovation inom omradet. Alternativa losningar med inspiration fran signalbehandlig har hittats som nyttjar problemestallningens glesa natur genom anvandning av Compressed Sensing. Svar hittas genom att simultant tilldela strangar till en for-processerad referensdatabas. Problemstallningen har forenklats till en statistisk modell av provror med ickeparametrisk estimering for att implicit fa ut fordelningen av bakteriearter med hjalp av konvex optimering. Denna rapport foreslar anvandningen av klustrering for aggregering av data for att forbattra tillforlitligheten av svaren och minska tiden for berakning av dessa. Anvandningen av parametriska modeller, Dirichlet fordelningen, har utforskats dar rapporten har kommit fram till att antaganden for lampligheten av denna som ett medel att aggregera k-mer vektorer ~Ar rimliga men att parameterestimeringen med Expectation Maximization ej fungerar val i samband med Dirichlet och en omskrivning av parametern skulle behovas i vektorrymden som spans av 16S rRNA genen. Slutligen sa har distansbaserad tilldelning av bakterier testats pa data fran verklig biologisk kontext med valdigt hog noggranhet. i

    Comparison of the underlying algorithms with and without ARK.

    No full text
    <p>Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Boxplot of the individual simulated sample execution times. Mean execution times for Quikr and ARK Quikr were 1.75 seconds and 4.71 minutes, while for SEK and ARK SEK they were 21.26 seconds and 19.21 minutes respectively. Mean execution time for RDP’s NBC was 38.19 minutes.</p

    Comparison of the underlying algorithms with and without ARK.

    No full text
    <p>Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Mean VD error at the genus level. Included for comparison are results for RDP’s NBC (compare to Fig 2(b) of [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0140644#pone.0140644.ref003" target="_blank">3</a>]).</p
    corecore