440 research outputs found

    ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era

    Get PDF

    Characterizing and Accelerating Bioinformatics Workloads on Modern Microarchitectures

    Get PDF
    Bioinformatics, the use of computer techniques to analyze biological data, has been a particularly active research field in the last two decades. Advances in this field have contributed to the collection of enormous amounts of data, and the sheer amount of available data has started to overtake the processing capability possible with current computer systems. Clearly, computer architects need to have a better understanding of how bioinformatics applications work and what kind of architectural techniques could be used to accelerate these important scientific workloads on future processors. In this dissertation, we develop a bioinformatic benchmark suite and provide a detailed characterization of these applications in common use today from a computer architect's point of view. We analyze a wide range of detailed execution characteristics including instruction mix, IPC measurements, L1 and L2 cache misses on a real architecture; and proceed to analyze the workloads' memory access characteristics. We then concentrate on accelerating a particularly computationally intensive bioinformatics workload on the novel Cell Broadband Engine multiprocessor architecture. The HMMER workload is used for protein profile searching using hidden Markov models, and most of its execution time is spent running the Viterbi algorithm. We parallelize and partition the HMMER application to implement it on the Cell Broadband Engine. In order to run the Viterbi algorithm on the 256KB local stores of the Cell BE synergistic processing units (SPEs), we present a method to develop a fast SIMD implementation of the Viterbi algorithm that reduces the storage requirements significantly. Our HMMER implementation for the Cell BE architecture, Cell-HMMER, exploits the multiple levels of parallelism inherent in this application, and can run protein profile searches up to 27.98 times faster than a modern dual-core x86 microprocessor

    Galaxy based BLAST submission to distributed national high throughput computing resources

    Get PDF
    To assist the bioinformatic community in leveraging the national cyberinfrastructure, the National Center for Genomic Analysis Support (NCGAS) along with Indiana University's High Throughput Computing (HTC) group have engineered a method to use the Galaxy to submit BLAST jobs to the Open Science Grid (OSG). OSG is a collaboration of resource providers that utilize opportunistic cycles at more than 100 universities and research centers in the US. BLAST jobs make a significant portion of the research conducted on NCGAS resources, moving jobs that are conducive to an HTC environment to the national cyberinfrastructure would alleviate load on resources at NCGAS and provide a cost effective solution for getting more cycles to reduce the unmet needs of bioinformatic researchers. To this point researchers have tackled this issue by purchasing additional resources or enlisting collaborators doing the same type of research, while HTC experts have focused on expanding the number of resources available to historically HTC friendly science workflows. In this paper, we bring together expertise from both areas to address how a bioinformatics researcher using their normal interface, Galaxy, can seamlessly access the OSG which routinely supplies researchers with millions of compute hours daily. Efficient use of these results will supply additional compute time to researcher and help provide a yet unmet need for BLAST computing cycles.This material is based upon work supported by the National Science Foundation under Grant No. ABI-1062432, Craig Stewart, PI. William Barnett, Matthew Hahn, and Michael Lynch, co-PIs. This work was supported in part by the Lilly Endowment, Inc. and the Indiana University Pervasive Technology Institute. Any opinions presented here are those of the presenter(s) and do not necessarily represent the opinions of the National Science Foundation or any other funding agencie

    Detection of RNA from a Novel West Nile-like Virus and High Prevalence of an Insect-specific Flavivirus in Mosquitoes in the Yucatan Peninsula of Mexico

    Get PDF
    As part of our ongoing surveillance efforts for West Nile virus (WNV) in the Yucatan Peninsula of Mexico, 96,687 mosquitoes collected from January through December 2007 were assayed by virus isolation in mammalian cells. Three mosquito pools caused cytopathic effect. Two isolates were orthobunyaviruses (Cache Valley virus and Kairi virus) and the identity of the third infectious agent was not determined. A subset of mosquitoes was also tested by reverse transcription-polymerase chain reaction (RT-PCR) using WNV-, flavivirus-, alphavirus-, and orthobunyavirus-specific primers. A total of 7,009 Culex quinquefasciatus in 210 pools were analyzed. Flavivirus RNA was detected in 146 (70%) pools, and all PCR products were sequenced. The nucleotide sequence of one PCR product was most closely related (71-73% identity) with homologous regions of several other flaviviruses, including WNV, St. Louis encephalitis virus, and Ilheus virus. These data suggest that a novel flavivirus (tentatively named T\u27Ho virus) is present in Mexico. The other 145 PCR products correspond to Culex flavivirus, an insect-specific flavivirus first isolated in Japan in 2003. Culex flavivirus was isolated in mosquito cells from approximately one in four homogenates tested. The genomic sequence of one isolate was determined. Surprisingly, heterogeneous sequences were identified at the distal end of the 5\u27 untranslated region

    Load-Balance and Fault-Tolerance for Massively Parallel Phylogenetic Inference

    Get PDF

    ์ •ํ™•ํ•œ ์„œ์—ด์ •๋ ฌ๊ธฐ๋ฒ•๊ณผ ์ธ๋ฉ”๋ชจ๋ฆฌ ํ•ต์‹ฌ ์œ ์ „์ž ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ธฐ๋ฐ˜์˜ ํ–ฅ์ƒ๋œ ๋ฉ”ํƒ€์œ ์ „์ฒด ๋ถ„๋ฅ˜๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต, 2020. 8. ์ฒœ์ข…์‹.์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค๋Š” ๋ฏธ์ƒ๋ฌผ๊ณผ ์ˆ™์ฃผ ๋˜๋Š” ํ™˜๊ฒฝ์‚ฌ์ด์˜ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ดํ•ดํ•˜๋Š”๋ฐ ๋งค์šฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ธฐ์ˆ ์˜ ๋ฐœ๋‹ฌ๊ณผ ๋”๋ถˆ์–ด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค๋ฅผ ํ†ตํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋ฏธ์ƒ๋ฌผ ์ข…์˜ ๋™์ •๊ณผ ๊ฐ ์ข…๋“ค์˜ ๋ถ„ํฌ๋Š” ๋งˆ์ดํฌ๋กœ๋ฐ”์ด์˜ด ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ์š”์†Œ๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ์ง€๋‚œ 10๋…„๊ฐ„ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค ๋ถ„์„์„ ์œ„ํ•œ ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋“ค์ด ๊ฐœ๋ฐœ๋˜์–ด์ ธ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ ํ˜น์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜ ์ •๋ณด์™€ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ธํ•˜์—ฌ ํŽธํ–ฅ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ๋„ ํ•˜์˜€๋Š”๋ฐ, ์ด๋ฅผ ๋ณด์™„ํ•˜๊ณ  ๋ณด๋‹ค ์ •ํ™•ํ•œ ๋ถ„๋ฅ˜ ๋™์ •์„ ์œ„ํ•ด ๋ฐฐ์–‘์ด ์–ด๋ ค์šด ํ‘œ์ค€ ๊ท ์ฃผ์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ท ์ฃผ์˜ ์œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋˜๊ณ  ์žˆ๋‹ค. ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค ๋ถ„์„์—์„œ ๋˜ ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ๋ถ„์„์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ์ƒ๋ฌผ์ •๋ณดํ•™์  ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ์— ์žˆ์–ด ๋ฉ”๋ชจ๋ฆฌ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”๊ฐ€ ๋˜์–ด์žˆ์ง€ ์•Š์•„ ๋ถ„์„์— ์ƒ๋‹นํ•œ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” exact match k-mer classification๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์„ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ Up-to-date Bacterial Core Gene (UBCG)๋ฅผ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. ๋ถ„์„์˜ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‘๊ฐœ์˜ ๊ธฐ์ค€ UBCG ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ ๋งŒ๋“ค์–ด ์กŒ์œผ๋ฉฐ ํ•œ ๊ฐœ๋Š” ๋ฐ•ํ…Œ๋ฆฌ์•„์˜ ๋ถ„๋ฅ˜์ฒด๊ณ„์—์„œ ์œ ํšจํ•œ ์ข…๋ช… (Valid names)๋งŒ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์™€ ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์œ ํšจํ•œ ์ข…๋ช…๊ณผ ํ•จ๊ป˜ EzBioCloud์— ์žˆ๋Š” genomospecies๋ฅผ ๊ฐ€์ง€๊ณ  ์ƒ์„ฑํ•˜์˜€๋‹ค. ๊ฒ€์ฆ์„ ์œ„ํ•ด Streptococcus ์ข…์„ ํฌํ•จํ•˜๋Š” (i) ํ•ฉ์„ฑ๋œ ๋ฉ”ํƒ€์ง€๋†ˆ ์ƒ˜ํ”Œ๊ณผ (ii) ๋งŒ์„ฑ ํ์‡„์„ฑ ํ์งˆํ™˜(COPD) ํ™˜์ž์˜ ์ž„์ƒ ๊ฒ€์ฒด (iii) ํ˜ˆ๋ฅ˜ ๊ฐ์—ผ ํ™˜์ž์˜ ์ž„์ƒ ๊ฒ€์ฒด๋กœ ์ด๋ฃจ์–ด์ง„ ์„ธ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด์— ๋„๋ฆฌ ์•Œ๋ ค์ง„ ์ƒท๊ฒƒ ํŒŒ์ดํ”„๋ผ์ธ์ธ MetaPhlan2๊ณผ ๋ณธ ์—ฐ๊ตฌ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋น„๊ต ๋ถ„์„ํ•˜์˜€๋‹ค. ์œ„ ๊ฒ€์ฆ ๋ถ„์„์—์„œ UBCG๋ฅผ ๊ธฐ์ค€ ์„œ์—ด๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•จ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๊ธฐ์ค€ ์œ ์ „์ฒด์—์„œ UBCG ์„œ์—ด์„ ๋ฝ‘์•„ ์ƒท๊ฑด ๋ถ„์„์— ์šฉ์ดํ•จ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋˜ํ•œ genomospecies๋ฅผ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ, ๋ณด๋‹ค ๊ฐœ์„ ๋œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋น„๋ก ์—ฌ๋Ÿฌ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋“ค์ด ์กด์žฌํ•˜์ง€๋งŒ ๋ณด๋‹ค ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ„๋ฅ˜๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„  ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ง€์†์ ์ธ ์—…๋ฐ์ดํŠธ์™€ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์˜ ๊ฒ€์ฆ์˜ ์ค‘์š”ํ•จ์„ ๊ฐ•์กฐํ•˜์˜€๋‹ค. ์ดํ›„ ๋ณธ ์—ฐ๊ตฌ์—์„œ ๊ฐœ๋ฐœ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ด์šฉํ•˜์—ฌ 4,000๊ฐœ์˜ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋†ˆ ์ƒ˜ํ”Œ์—์„œ ์‚ฌ๋žŒ์— ์žฅ๋‚ด์— ๊ฐ€์žฅ ๋งŽ์ด ๋ฐœ๊ฒฌ๋˜๋Š” Bacteroides ์ข…์— ๋Œ€ํ•œ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ์กด์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” MetaPhlAn2 ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์—ˆ์œผ๋ฉฐ ๋ถ„์„ ๊ฒฐ๊ณผ Bacteroides๋Š” ๋„์‹œํ™”๋œ ์‚ฌ๋žŒ์—๊ฒŒ ๋งŽ์ด ๋ถ„ํฌํ•˜๋Š” ๋ฐ˜๋ฉด ์•„ํ”„๋ฆฌ์นด ํ˜น์€ ๋‚จ๋ฏธ์ง€์—ญ์—์„œ ์›์‹œ์  ๋ถ€์กฑ์˜ ์‚ถ์„ ์‚ฌ๋Š” ์‚ฌ๋žŒ์—๊ฒŒ์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ ๊ฒŒ ๋ถ„ํฌํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ๊ฐ ๋‚˜๋ผ๋ณ„ ์ธ๊ตฌ์—์„œ๋Š” ์šฐ์ ๋˜๋Š” Bacteroides ์ข…์ด ๋‹ค๋ฆ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ ์ด๋Š” ๊ฐ ์—ฐ๊ตฌ์˜ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ• ํ˜น์€ ์œ„์น˜์— ๋”ฐ๋ผ ์„ค๋ช…๋˜์–ด ์งˆ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์‹คํ—˜์šฉ ์ฅ์˜ ๊ฒฐ๊ณผ์—์„œ๋Š” ๊ฐ€์žฅ ๋‹ค์–‘ํ•œ Bacteroides๋ฅผ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ ์ด๋Š” ๋งŽ์€ ์ˆ˜์˜ ๊ธฐ์ค€ ์œ ์ „์ฒด๊ฐ€ ์ƒ์ฅ์—๊ฒŒ์„œ ๋‚˜์™”๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐ๋œ๋‹ค. ๋˜ํ•œ ๊ณ ์–‘์ด๋‚˜ ๊ฐ•์•„์ง€ ๊ฐ™์€ ๋ฐ˜๋ ค๋™๋ฌผ์˜ ์ƒ˜ํ”Œ์—์„œ๋„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ ๊ฐ ๋™๋ฌผ๋“ค์˜ ์ƒํ™œ์–‘์‹๊ณผ ๋จน์ด์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋ณด๋‹ค ๋งŽ์€ ๋ฉ”ํƒ€์ง€๋†ˆ ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ํ•„์š”์„ฑ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ํ•ต์‹ฌ ์œ ์ „์ž๋“ค์„ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์‹คํšจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ์œ ์ „์ž ๊ธฐ๋ฐ˜์˜ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” ๋ณด๋‹ค ์ •ํ™•ํ•˜๊ณ  ์ „์ฒด ๋ฏธ์ƒ๋ฌผ์˜ ํ’๋ถ€๋„๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๊ณ  k-mer ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๊ธฐ์กด์— ์กด์žฌํ•˜๋˜ ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ ๋ณด๋‹ค ๋”์šฑ ๋น ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋น ๋ฅด๊ฒŒ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ญ์ƒ ์ตœ์‹ ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ๊ถ๊ทน์ ์œผ๋กœ ๋ณธ ์—ฐ๊ตฌ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹ค์งˆ์ ์œผ๋กœ ์—ฐ๊ตฌ๋‚˜ ์ง„๋‹จ ๋ชฉ์ ์œผ๋กœ ์ด์šฉํ•˜๋Š” ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค.Shotgun metagenomics is of great importance to understand the microbial community composition of a sample and the impact it has on its host. The proper identification and quantification of bacterial species is a key component of any microbiome research that is based on metagenomic samples. In the last decade, several algorithms and databases have been developed, however the differences between references and the type of algorithm used for the classification makes the comparisons among themselves unfair and bias. The contents of the reference database, including genome sequences of type strains or reference genomes of uncultured species, have a great impact on the performance of the classification results of metagenomic samples. Another significant factor on shotgun metagenomics is the classification speed as most current bioinformatic tools lack computational and memory optimization. Here, I propose several enhancements to a well-known method, exact match k-mer classification in order to increase the overall speed of a metagenomic classification. This method was further improved by the use of Up-to-date Bacterial Core Gene (UBCG) sequences to provide better method for a faster and accurate shotgun metagenomic profiling classification. In order to prove the efficiency of our method, I built two UBCG-based reference databases: one containing UBCG sequences of valid named species, and the second one containing UBCG sequences of all valid named species and genomospecies in the EzBioCloud database. Three datasets containing Streptococcus species were used to evaluate the improved method against the MetaPhlan2 tool which is the most widely used open-source shotgun metagenomic classifier: (i) synthetic metagenomic samples, (ii) clinical sputum samples from patients with chronic obstructive pulmonary disease (COPD), and (iii) clinical samples of a blood stream infection. In this analysis, I demonstrated that UBCG sequences can be used as references for metagenomic classification, showing that they are easy to extract from genome sequences and accurate when predicting relative abundance. I also showed that the inclusion of genomospecies in the reference databases, significantly improves the classification accuracy of bacterial species within a metagenomic sample. Finally, I showed that while publicly available pipelines and databases are easily accessible, for accurate and reliable taxonomic classification, an updated database with proper taxonomic and genomic curation must be used. The method devised in this work is then applied to profile the Bacteroides species in over 4,000 shotgun metagenomic samples, which is one of most abundant members of the human gut microbiome. This task cannot be accomplished using conventional tools such as MetaPhlAn2 due to the high processing time they require. The results in this study showed that Bacteroides is high abundant in human samples from urban areas while being low abundant in humans from rural areas, particularly African and South American tribes. Countries showed dominance for a specific Bacteroides species, but this could also be explained by the type of study were the samples came from. Mice samples showed the most diversity of Bacteroides, this can be attributed by the number of bacterial references isolated from this organism. House cat and dog samples showed correlation between each other, this may be attributed to the similarities of their lifestyle and diet. This study shows the importance of having a great number of samples for any given metagenomic analysis, and even though, we have profiled thousands of samples, more might be needed in the future. The method proposed in this thesis demonstrates that core genes are reliable reference sequences for shotgun metagenomics. Their implementation as reference sequences in metagenomic databases improves the accuracy of the abundance prediction of any given sample. Additionally, with the use of a k-mer approach, this methods running time outperforms the most popular shotgun metagenomic tools. The work presented in this thesis aims to help microbial research by providing faster and accurate metagenomic taxonomic predictions. Finally, with the ability of updating a metagenomic database with ease, will help researchers to obtain the most up-to-date results to find potential diagnosis or treatments for diseases associated to human microbial communities.Chapter 1. General Introduction 1 1.1. Introduction to metagenomics 2 1.2. 16S rRNA sequencing 3 1.3. Shotgun metagenomic sequencing 5 1.3.1. History 5 1.3.2. Sample extraction 7 1.3.3. Library preparation 8 1.3.4. Sequencing 8 1.4. Shotgun metagenomic classification 9 1.4.1. Homology-based approaches 9 1.4.2. Exact match K-mer approaches 11 Chapter 2. An exact match k-mer algorithm 13 2.1. An exact match k-mer classification approach 14 2.1.1. Definition of the problem 14 2.1.2. Building a k-mer reference database 14 2.1.2.1. K-mer counting 14 2.1.2.2. K-mer mapping 16 2.1.3. Classification of a metagenomic read 16 2.1.3.1. K-mer search 19 2.1.3.2. Scoring a metagenomic read 20 2.1.4. Calculating the metagenome profile 20 2.1.4.1. Normalization for LCA-assigned reads 21 2.1.4.2. Normalization for cell count relative abundance 22 2.2. RAM memory usage 22 2.3. Quality Control 23 2.3.1. Read Trimming 23 2.3.2. Host read removal 24 Chapter 3. Revealing unrecognized species in the genus Streptococcus 28 3.1. A brief history of streptococcus in clinical metagenomics 29 3.2. Results and Discussion 32 3.2.1. Building a core gene reference database 32 3.2.2. Evaluation of Pipelines using Synthetic Metagenomes 36 3.2.3. Chronic obstructive pulmonary disease samples 44 3.2.3. Evaluating the value of genomospecies references in a metagenomic database 56 3.2.4. Identifying accurately a Streptococcal infection using clinical data 63 3.2.5. Effects of different ANI thresholds on the classification of genomospecies 69 3.3. Materials and Methods 76 3.3.1. Selecting the reference genomes 76 3.3.2. Average nucleotide identity and hierarchical clustering 76 3.3.3. Synthetic and Real metagenomic samples 77 3.3.4. Extracting the core genes 77 3.3.5. Taxonomic profiling 83 3.3.6. Biomarker discovery 84 3.4. Conclusions 85 Chapter 4. A large-scale shotgun metagenomic analysis on Bacteroides 86 4.1. Introduction 87 4.2. Bacteroides on the human gut 89 4.2.1. Collecting the samples 89 4.2.2. Methods 89 4.2.2.1. Reference Genomes 89 4.2.2.2. Metagenome profiling 90 4.2.3. Results 103 4.3. Bacteroides on Animal Species 128 4.3.1. Methods 128 4.3.2. Results 128 4.4. Discussion and conclusions 133 General Conclusion 135 References 139 Appendix I. A list of genomes from the genus Streptococcus used on Chapters 3 analysis. 146 ๊ตญ๋ฌธ์ดˆ๋ก 155Docto

    Mechanisms to improve the efficiency of hardware data prefetchers

    Get PDF
    A well known performance bottleneck in computer architecture is the so-called memory wall. This term refers to the huge disparity between on-chip and off-chip access latencies. Historically speaking, the operating frequency of processors has increased at a steady pace, while most past advances in memory technology have been in density, not speed. Nowadays, the trend for ever increasing processor operating frequencies has been replaced by an increasing number of CPU cores per chip. This will continue to exacerbate the memory wall problem, as several cores now have to compete for off-chip data access. As multi-core systems pack more and more cores, it is expected that the access latency as observed by each core will continue to increase. Although the causes of the memory wall have changed, it is, and will continue to be in the near future, a very significant challenge in terms of computer architecture design. Prefetching has been an important technique to amortize the effect of the memory wall. With prefetching, data or instructions that are expected to be used in the near future are speculatively moved up in the memory hierarchy, were the access latency is smaller. This dissertation focuses on hardware data prefetching at the last cache level before memory (last level cache, LLC). Prefetching at the LLC usually offers the best performance increase, as this is where the disparity between hit and miss latencies is the largest. Hardware prefetchers operate by examining the miss address stream generated by the cache and identifying patterns and correlations between the misses. Most prefetchers divide the global miss stream in several sub-streams, according to some pre-specified criteria. This process is known as localization. The benefits of localization are well established: it increases the accuracy of the predictions and helps filtering out spurious, non-predictable misses. However localization has one important drawback: since the misses are classified into different sub-streams, important chronological information is lost. A consequence of this is that most localizing prefetchers issue prefetches in an untimely manner, fetching data too far in advance. This behavior promotes data pollution in the cache. The first part of this thesis proposes a new class of prefetchers based on the novel concept of Stream Chaining. With Stream Chaining, the prefetcher tries to reconstruct the chronological information lost in the process of localization, while at the same time keeping its benefits. We describe two novel Stream Chaining prefetching algorithms based on two state of the art localizing prefetchers: PC/DC and C/DC. We show how both prefetchers issue prefetches in a more timely manner than their nonchaining counterparts, increasing performance by as much as 55% (10% on average) on a suite of sequential benchmarks, while consuming roughly the same amount of memory bandwidth. In order to hide the effects of the memory wall, hardware prefetchers are usually configured to aggressively prefetch as much data as possible. However, a highly aggressive prefetcher can have negative effects on performance. Factors such as prefetching accuracy, cache pollution and memory bandwidth consumption have to be taken into account. This is specially important in the context of multi-core systems, where typically each core has its own prefetching engine and there is high competition for accessing memory. Several prefetch throttling and filtering mechanisms have been proposed to maximize the effect of prefetching in multi-core systems. The general strategy behind these heuristics is to promote prefetches that are more likely to be used and cause less interference. Traditionally these methods operate at the source level, i.e., directly into the prefetch engine they are assigned to control. In multi-core systems all prefetches are aggregated in a FIFO-like data structure called the Prefetch Request Queue (PRQ), where they wait to be dispatched to memory. The second part of this thesis shows that a traditional FIFO PRQ does not promote a timely prefetching behavior and usually hinders part of the performance benefits achieved by throttling heuristics. We propose a novel approach to prefetch aggressiveness control in multi-cores that performs throttling at the PRQ (i.e., global) level, using global knowledge of the metrics of all prefetchers and information about the global state of the PRQ. To do this, we introduce the Resizable Prefetching Heap (RPH), a data structure modeled after a binary heap that promotes timely dispatch of prefetches as well as fairness in the distribution of prefetching bandwidth. The RPH is designed as a drop-in replacement of traditional FIFO PRQs. We compare our proposal against a state-of-the-art source-level throttling algorithm (HPAC) in a 8-core system. Unlike previous research, we evaluate both multiprogrammed and multithreaded (parallel) workloads, using a modern prefetching algorithm (C/DC). Our experimental results show that RPH-based throttling increases the throttling performance benefits obtained by HPAC by as much as 148% (53.8% average) in multiprogrammed workloads and as much as 237% (22.5% average) in parallel benchmarks, while consuming roughly the same amount of memory bandwidth. When comparing the speedup over fixed degree prefetching, RPH increased the average speedup of HPAC from 7.1% to 10.9% in multiprogrammed workloads, and from 5.1% to 7.9% in parallel benchmarks
    • โ€ฆ
    corecore