unknown

Implementation, adaptation and evaluation of statistical analysis techniques for next generation sequencing data

Abstract

Deep sequencing is a new high‐throughput sequencing technology intended to lower the cost of DNA sequencing further than what was previously thought possible using standard methods. Analysis of sequencing data such as SAGE (serial analysis of gene expression) and microarray data has been a popular area of research in recent years. The increasing development of these different technologies and the variety of the data produced has stressed the need for efficient analysis techniques. Various methods for the analysis of sequencing data have been developed in recent years: both SAGE data, which is discrete; and microarray data, which is continuous. These include simple analysis techniques, hierarchical clustering techniques (both Bayesian and Frequentist) and various methods for finding differential expression between groups of samples. These methods range from simple comparison techniques to more complicated computational methods, which attempt to isolate the more subtle dissimilarities in the data. Various analysis techniques are used in this thesis for the analysis of unpublished deep sequencing data. This analysis was approached in three sections. The first was looking at clustering techniques previously developed for SAGE data, Poisson C / Poisson L algorithm and a Bayesian hierarchical clustering algorithm and evaluating and adapting these techniques for use on the deep sequencing data. The second was looking at methods to find differentially expressed tags in the dataset. These differentially expressed tags are of interest, as it is believed that finding tags which are significantly up or down regulatedacross groups of samples could potentially be useful in the treatment of certain diseases. Finally due to the lack of published data, a simulation study was constructed using various models to simulate the data and assess the techniques mentioned above on data with pre‐defined sample groupings and differentially expressed tags. The main goals of the simulation study were the validation of the analysis techniques previously discussed and estimation of false positive rates for this type of large, sparse dataset. The Bayesian algorithm yielded surprising results, producing no hierarchy, suggesting no evidence of clustering. However, promising results were obtained for the adapted Poisson C / Poisson L algorithm applied using various models to fit the data and measures of similarity. Further investigation is needed to confirm whether it is suitable for the clustering of deep sequencing data in general, especially where the situation of three or more groups of interest occurs. From the results of the differential expression analysis it can be deduced that the overdispersed log linear method for the analysis of differential expression, particularly when compared to simple test such as the 2‐sample t‐tests and the Wilcoxon signed rank test is the most reliable. This deduction is made based upon the results of the overlapping with other methods and the more reasonable number of differentially expressed tags detected, in contrast to those detected using the adapted log ratio method. However none of this can be confirmed, as no information was known about the tags in either dataset. The success of the Poisson C / Poisson L algorithm on both the Poisson and Truncated Poisson simulated datasets suggests that the method of simulation is acceptable for the assessment of clustering algorithms developed for use on sequencing data. However, evaluation of the differential expression analysis performed on the simulated data indicates that further work is needed on the method of simulation to increase its reliability. The algorithms presented can be adapted for use on any form of discrete data. From the work done here, there is there is evidence that the adapted Poisson C / Poisson L algorithm is a promising technique for the analysis of deep sequencing data

    Similar works