3 research outputs found

    Evaluation and Optimization of Bioinformatic Tools for the Detection of Human Foodborne Pathogens in Complex Metagenomic Datasets

    Get PDF
    Foodborne human pathogens pose a significant risk to human health as each year one in six Americans becomes sick from one of over 31 known human foodborne pathogens. Due to the differences in their growth requirements, current detection assays can only detect one to a few of these pathogens per single assay. Metagenomics, an emerging field, allows for an entire community of organisms to be analyzed from DNA or RNA sequence data generated from a single sample, and therefore has the potential to detect any and all foodborne pathogens present in a single complex matrix. However, currently available bioinformatic pipelines for metagenomic sequence analysis require extensive time and high computer power inputs, often with unreliable results. The objectives of this study are 1) to evaluate community profiling bioinformatic pipelines, mapping pipelines and a novel pipeline created at Oklahoma State University, E-probe Diagnostic Nucleic-acid Analysis (EDNA), for the detection of S. enterica (as a model foodborne pathogen) in metagenomic data, 2) to optimize EDNA pipeline for sensitive detection of the S. enterica in metagenomic data, and 3) to simultaneously detect multiple foodborne pathogens from a single metagenomic sample. EDNA was able to detect S. enterica in metagenomic data in approximately five minutes compared to the other pipelines, which took between 2-500 hours. The optimized parameters for the EDNA pipeline were limited to using cleaned Illumina data with a read depth of one. The minimum BLAST E-value was set to 10^-3 for curation. For detection the minimum percent identity was set to 95% and the minimum query coverage to 90% with an E-probe length of 80 nt. These new parameters significantly improved the sensitivity of the assay 100-fold, from 10^3 S. enterica cells detected by the original EDNA pipeline to just 10 cells. In the simultaneous detection of multiple foodborne pathogens, EDNA detected three additional pathogens Listeria monocytogenes, Campylobacter jejuni and Shiga toxin producing Escherichia coli at ten contamination levels in less than ten minutes and provided new detection insights into read abundance as it corresponds to pathogen cell numbers

    Statistical modeling and inference for complex-structured count data with applications in genomics and social science

    Get PDF
    2020 Spring.Includes bibliographical references.This dissertation describes models, estimation methods, and testing procedures for count data that build upon classic generalized linear models, including Gaussian, Poisson, and negative binomial regression. The methodological extensions proposed in this dissertation are motivated by complex structures for count data arising in three important classes of scientific problems, from both genomics and sociological contexts. Complexities include large scale, temporal dependence, zero-inflation and other mixture features, and group structure. The first class of problems involves count data that are collected from longitudinal RNA sequencing (RNA-seq) experiments, where the data consist of tens of thousands of short time series of counts, with replicate time series under treatment and under control. In order to determine if the time course differs between treatment and control, we consider two questions: 1) whether the treatment affects the geometric attributes of the temporal profiles and 2) whether any treatment effect varies over time. To answer the first question, we determine whether there has been a fundamental change in shape by modeling the transformed count data for genes at each time point using a Gaussian distribution, with the mean temporal profile generated by spline models, and introduce a measurement that quantifies the average minimum squared distance between the locations of peaks (or valleys) of each gene's temporal profile across experimental conditions. We then develop a testing framework based on a permutation procedure. Via simulation studies, we show that the proposed test achieves good power while controlling the false discovery rate. We also apply the test to data collected from a light physiology experiment on maize. To answer the second question, we model the time series of counts for each gene by a Gaussian-Negative Binomial model and introduce a new testing procedure that enjoys the optimality property of maximum average power. The test allows not only identification of traditional differentially expressed genes but also testing of a variety of composite hypotheses of biological interest. We establish the identifiability of the proposed model, implement the proposed method via efficient algorithms, and expose its good performance via simulation studies. The procedure reveals interesting biological insights when applied to data from an experiment that examines the effect of varying light environments on the fundamental physiology of a marine diatom. The second class of problems involves analyzing group-structured sRNA data that consist of independent replicates of counts for each sRNA across experimental conditions. Most existing methods—for both normalization and differential expression—are designed for non-group structured data. These methods may fail to provide correct normalization factors or fail to control FDR. They may lack power and may not be able to make inference on group effects. To address these challenges simultaneously, we introduce an inferential procedure using a group-based negative binomial model and a bootstrap testing method. This procedure not only provides a group-based normalization factor, but also enables group-based differential expression analysis. Our method shows good performance in both simulation studies and analysis of experimental data on roundworm. The last class of problems is motivated by the study of sensitive behaviors. These problems involve mixture-distributed count data that are collected by a quantitative randomized response technique (QRRT) which guarantees respondent anonymity. We propose a Poisson regression method based on maximum likelihood estimation computed via the EM algorithm. This method allows assessment of the importance of potential drivers of different quantities of non-compliant behavior. The method is illustrated with a case study examining potential drivers of non-compliance with hunting regulations in Sierra Leone

    Mise en place d'approches bioinformatiques innovantes pour l'intégration de données multi-omiques longitudinales

    Get PDF
    Les nouvelles technologies «omiques» à haut débit, incluant la génomique, l'épigénomique, la transcriptomique, la protéomique, la métabolomique ou encore la métagénomique, ont connues ces dernières années un développement considérable. Indépendamment, chaque technologie omique est une source d'information incontournable pour l'étude du génome humain, de l'épigénome, du transcriptome, du protéome, du métabolome, et également de son microbiote permettant ainsi d'identifier des biomarqueurs responsables de maladies, de déterminer des cibles thérapeutiques, d'établir des diagnostics préventifs et d'accroître les connaissances du vivant. La réduction des coûts et la facilité d'acquisition des données multi-omiques à permis de proposer de nouveaux plans expérimentaux de type série temporelle où le même échantillon biologique est séquencé, mesuré et quantifié à plusieurs temps de mesures. Grâce à l'étude combinée des technologies omiques et des séries temporelles, il est possible de capturer les changements d'expressions qui s'opèrent dans un système dynamique pour chaque molécule et avoir une vision globale des interactions multi-omiques, inaccessibles par une approche simple standard. Cependant le traitement de cette somme de connaissances multi-omiques fait face à de nouveaux défis : l'évolution constante des technologies, le volume des données produites, leur hétérogénéité, la variété des données omiques et l'interprétabilité des résultats d'intégration nécessitent de nouvelles méthodes d'analyses et des outils innovants, capables d'identifier les éléments utiles à travers cette multitude d'informations. Dans cette perspective, nous proposons plusieurs outils et méthodes pour faire face aux challenges liés à l'intégration et l'interprétation de ces données multi-omiques particulières. Enfin, l'intégration de données multi-omiques longitudinales offre des perspectives dans des domaines tels que la médecine de précision ou pour des applications environnementales et industrielles. La démocratisation des analyses multi-omiques et la mise en place de méthodes d'intégration et d'interprétation innovantes permettront assurément d'obtenir une meilleure compréhension des écosystèmes biologiques.New high-throughput «omics» technologies, including genomics, epigenomics, transcriptomics, proteomics, metabolomics and metagenomics, have expanded considerably in recent years. Independently, each omics technology is an essential source of knowledge for the study of the human genome, epigenome, transcriptome, proteome, metabolome, and also its microbiota, thus making it possible to identify biomarkers leading to diseases, to identify therapeutic targets, to establish preventive diagnoses and to increase knowledge of living organisms. Cost reduction and ease of multi-omics data acquisition resulted in new experimental designs based on time series in which the same biological sample is sequenced, measured and quantified at several measurement times. Thanks to the combined study of omics technologies and time series, it is possible to capture the changes in expression that take place in a dynamic system for each molecule and get a comprehensive view of the multi-omics interactions, which was inaccessible with a simple standard omics approach. However, dealing with this amount of multi-omics data faces new challenges: continuous technological evolution, large volumes of produced data, heterogeneity, variety of omics data and interpretation of integration results require new analysis methods and innovative tools, capable of identifying useful elements through this multitude of information. In this perspective, we propose several tools and methods to face the challenges related to the integration and interpretation of these particular multi-omics data. Finally, integration of longidinal multi-omics data offers prospects in fields such as precision medicine or for environmental and industrial applications. Democratisation of multi-omics analyses and the implementation of innovative integration and interpretation methods will definitely lead to a deeper understanding of eco-systems biology
    corecore