15 research outputs found

    Acceleration of Big Data/Hadoop applications using GPU's

    Full text link
    The goal of this work is to see which di erences graphical processing units can make while working with big data. Creating a real world application is not part of this goal. Data mining and big data analysis allows organizations to nd useful information and insights from huge amounts of structured or unstructured data. This information can be used to improve all kinds of services to customers or get the upper hand over competitors. In the Information Age data mining is becoming more and more popular due to the fact that more data is available. This is where Hadoop comes in. Hadoop provides a scalable and free framework which enables data analysis and operations on big data. Hadoop is very innovative in the eld of big data. GPU's are highly parallel processors. By executing an algorithm in parallel which would otherwise be executed sequentially an increase in performance can be obtained. The Hadoop framework will be combined with NVIDIA's CUDA to complete the task of data analysis. Hadoop will call CUDA from its mapper or reducer. The computational intensive task will be moved away from Java to CUDA and be done in parallel. This will be done by porting two algorithms to CUDA and run these on all the compute nodes of the Hadoop cluster. The two algorithms are sum of rst n integers and factorial n. Both of them will be written in Java and CUDA in order to compare the execution time. Both implementations of the two algorithms will be executed for millions of di erent values for n. The CUDA implementations will compute multiple factorials or sums at the same time while the Java implementations will compute them all sequentially. This work will show that not every algorithm is a viable option. The algorithm has to match certain criteria in order to create an e ective CUDA port. Being time consuming without the need for large amounts of input or output data is the most important criteria. Obviously the algorithm has to allow a parallel version. If the algorithm needs the result from the previous input value parallelizing it will be a lot harder. This is shown by the factorial n algorithm. This algorithm is computationally intensive but produces a lot of output data. Copying data to and from the GPU is a time consuming task. When a lot of input data is required or a lot of output data is generated, copying this data might have such an impact that there is no more gain in execution time when using CUDA. This work proves that certain algorithms which handle big data can be ported to CUDA with a decrease in execution time as result, while also showing that others do not meet the necessary requirements.El objetivo de este trabajo es para identificar las ventajas que ofrece el uso de unidades de procesamientos gráfico cuando se trabaja con grandes volúmenes de datos. La minería de datos y análisis big data permite a las organizaciones encontrar información útil de las enormes cantidades de datos estructurados y no estructurados. Esta información puede utilizarse para mejorar todo tipo de servicios a los clientes o conseguir ventaja sobre sus competidores. En la Era de la Información la minería de datos es cada vez más popular debido que hay más datos disponibles. Aquí es donde entra Hadoop. Hadoop ofrece un conjunto de herramientas escalables y libres que permiten el análisis de datos y las operaciones con grandes cantidades de datos. Hadoop es muy innovador en el campo de big data. Los GPU's son procesadores altamente paralelos. Mediante la ejecución de un algoritmo en paralelo se puede mejorar el rendimiento del mismo algoritmo ejecutado secuencialmente. El framework Hadoop se combinará con CUDA de NVIDIA para completar la tarea de análisis de datos. Hadoop llamará CUDA de su mapper o reducer. La tarea intensiva computacional se alejó de Java para CUDA y hacerse en paralelo. La integración de Hadoop y CUDA se ha realizado siendo posible ejecutar algoritmos en los nodos de cómputo del clúster Hadoop con GPUs. Los dos algoritmos son la suma de los primeros n enteros y factorial n. Ambos serán escritos en Java y CUDA con el fin de comparar el tiempo de ejecución. Ambas implementaciones de los dos algoritmos se ejecutarán para millones de diferentes valores de n. Las implementaciones CUDA computarán múltiples factoriales o sumas, al mismo tiempo, mientras que las implementaciones Java computarán todos ellos secuencialmente. Este trabajo demuestra que no todos los algoritmos son opciones viables. El algoritmo tiene que cumplir con ciertos criterios para poder portarlos a CUDA y obtener buenos rendimientos. Si consumen mucho tiempo de ejecución sin la necesidad de grandes cantidades de datos de entrada o de salida entonces el resultado será satisfactorio. Obviamente, el algoritmo tiene que permitir una versión paralela. Si el algoritmo necesita el resultado del valor de entrada anterior paralización será mucho más difícil. Esto se muestra para el algoritmo factorial. Este algoritmo es computacionalmente intensivo, pero produce una gran cantidad de datos de salida. Copia de datos desde y hacia la GPU es una tarea que consume mucho tiempo. Cuando se requiere una gran cantidad de datos de entrada o se genera una gran cantidad de datos de salida, la copia de estos datos podrá tener un impacto tan grande que no hay más ganancia en tiempo de ejecución cuando se utiliza CUDA. Este trabajo demuestra que ciertos algoritmos que manejan grandes volúmenes de datos se pueden trasladar a CUDA con una disminución en el tiempo de ejecución

    A treatment recommender clinical decision support system for personalized medicine: method development and proof-of-concept for drug resistant tuberculosis

    Get PDF
    Background Personalized medicine tailors care based on the patient’s or pathogen’s genotypic and phenotypic characteristics. An automated Clinical Decision Support System (CDSS) could help translate the genotypic and phenotypic characteristics into optimal treatment and thus facilitate implementation of individualized treatment by less experienced physicians. Methods We developed a hybrid knowledge- and data-driven treatment recommender CDSS. Stakeholders and experts first define the knowledge base by identifying and quantifying drug and regimen features for the prototype model input. In an iterative manner, feedback from experts is harvested to generate model training datasets, machine learning methods are applied to identify complex relations and patterns in the data, and model performance is assessed by estimating the precision at one, mean reciprocal rank and mean average precision. Once the model performance no longer iteratively increases, a validation dataset is used to assess model overfitting. Results We applied the novel methodology to develop a treatment recommender CDSS for individualized treatment of drug resistant tuberculosis as a proof of concept. Using input from stakeholders and three rounds of expert feedback on a dataset of 355 patients with 129 unique drug resistance profiles, the model had a 95% precision at 1 indicating that the highest ranked treatment regimen was considered appropriate by the experts in 95% of cases. Use of a validation data set however suggested substantial model overfitting, with a reduction in precision at 1 to 78%. Conclusion Our novel and flexible hybrid knowledge- and data-driven treatment recommender CDSS is a first step towards the automation of individualized treatment for personalized medicine. Further research should assess its value in fields other than drug resistant tuberculosis, develop solid statistical approaches to assess model performance, and evaluate their accuracy in real-life clinical settings

    Comprehensive and accurate genetic variant identification from contaminated and low-coverage **Mycobacterium tuberculosis** whole genome sequencing data

    No full text
    Improved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing, however, cannot reveal Mtb’s full genomic diversity due to the strict requirements of low contamination levels, high Mtb sequence coverage and elimination of complex regions. We have developed the XBS (compleX Bacterial Samples) bioinformatics pipeline, which implements joint calling and machine-learning-based variant filtering tools to specifically improve variant detection in the important Mtb samples that do not meet these criteria, such as those from unbiased sputum samples. Using novel simulated datasets, which permit exact accuracy verification, XBS was compared to the UVP and MTBseq pipelines. Accuracy statistics showed that all three pipelines performed equally well for sequence data that resemble those obtained from culture isolates of high depth of coverage and low-level contamination. In the complex genomic regions, however, XBS accurately identified 9.0 % more SNPs and 8.1 % more single nucleotide insertions and deletions than the WHO-endorsed unified analysis variant pipeline. XBS also had superior accuracy for sequence data that resemble those obtained directly from sputum samples, where depth of coverage is typically very low and contamination levels are high. XBS was the only pipeline not affected by low depth of coverage (5–10×), type of contamination and excessive contamination levels (>50 %). Simulation results were confirmed using whole genome sequencing (WGS) data from clinical samples, confirming the superior performance of XBS with a higher sensitivity (98.8%) when analysing culture isolates and identification of 13.9 % more variable sites in WGS data from sputum samples as compared to MTBseq, without evidence for false positive variants when rRNA regions were excluded. The XBS pipeline facilitates sequencing of less-than-perfect Mtb samples. These advances will benefit future clinical applications of Mtb sequencing, especially WGS directly from clinical specimens, thereby avoiding in vitro biases and making many more samples available for drug resistance and other genomic analyses. The additional genetic resolution and increased sample success rate will improve genome-wide association studies and sequence-based transmission studies

    TBProfiler for automated calling of the association with drug resistance of variants in Mycobacterium tuberculosis. S2 File

    No full text
    Data associated with the paper, "TBProfiler for automated calling of the association with drug resistance of variants in Mycobacterium tuberculosis"

    TBProfiler for automated calling of the association with drug resistance of variants in Mycobacterium tuberculosis.

    Get PDF
    Following a huge global effort, the first World Health Organization (WHO)-endorsed catalogue of 17,356 variants in the Mycobacterium tuberculosis complex along with their classification as associated with resistance (interim), not associated with resistance (interim) or uncertain significance was made public In June 2021. This marks a critical step towards the application of next generation sequencing (NGS) data for clinical care. Unfortunately, the variant format used makes it difficult to look up variants when NGS data is generated by other bioinformatics pipelines. Furthermore, the large number of variants of uncertain significance in the catalogue hamper its useability in clinical practice. We successfully converted 98.3% of variants from the WHO catalogue format to the standardized HGVS format. We also created TBProfiler version 4.4.0 to automate the calling of all variants located in the tier 1 and 2 candidate resistance genes along with their classification when listed in the WHO catalogue. Using a representative sample of 339 clinical isolates from South Africa containing 691 variants in a tier 1 or 2 gene, TBProfiler classified 105 (15%) variants as conferring resistance, 72 (10%) as not conferring resistance and 514 (74%) as unclassified, with an average of 29 unclassified variants per isolate. Using a second cohort of 56 clinical isolates from a TB outbreak in Spain containing 21 variants in the tier 1 and 2 genes, TBProfiler classified 13 (61.9%) as unclassified, 7 (33.3%) as not conferring resistance, and a single variant (4.8%) classified as conferring resistance. Continued global efforts using standardized methods for genotyping, phenotyping and bioinformatic analyses will be essential to ensure that knowledge on genomic variants translates into improved patient care

    TBProfiler for automated calling of the association with drug resistance of variants in Mycobacterium tuberculosis. S1 File

    No full text
    Data associated with the paper, "TBProfiler for automated calling of the association with drug resistance of variants in Mycobacterium tuberculosis"

    The MAGMA pipeline for comprehensive genomic analyses of clinical Mycobacterium tuberculosis samples

    No full text
    Abstract: Background Whole genome sequencing (WGS) holds great potential for the management and control of tuberculosis. Accurate analysis of samples with low mycobacterial burden, which are characterized by low (40%) levels of contamination, is challenging. We created the MAGMA (Maximum Accessible Genome for Mtb Analysis) bioinformatics pipeline for analysis of clinical Mtb samples.Methods and results High accuracy variant calling is achieved by using a long seedlength during read mapping to filter out contaminants, variant quality score recalibration with machine learning to identify genuine genomic variants, and joint variant calling for low Mtb coverage genomes. MAGMA automatically generates a standardized and comprehensive output of drug resistance information and resistance classification based on the WHO catalogue of Mtb mutations. MAGMA automatically generates phylogenetic trees with drug resistance annotations and trees that visualize the presence of clusters. Drug resistance and phylogeny outputs from sequencing data of 79 primary liquid cultures were compared between the MAGMA and MTBseq pipelines. The MTBseq pipeline reported only a proportion of the variants in candidate drug resistance genes that were reported by MAGMA. Notable differences were in structural variants, variants in highly conserved rrs and rrl genes, and variants in candidate resistance genes for bedaquiline, clofazmine, and delamanid. Phylogeny results were similar between pipelines but only MAGMA visualized clusters.Conclusion The MAGMA pipeline could facilitate the integration of WGS into clinical care as it generates clinically relevant data on drug resistance and phylogeny in an automated, standardized, and reproducible manner
    corecore