25 research outputs found

    Parallelization of Mapping Algorithms for Next Generation Sequencing Applications

    Get PDF
    With the advent of next-generation high throughput sequencing instruments, large volumes of short sequence data are generated at an unprecedented rate. Processing and analyzing these massive data requires overcoming several challenges. A particular challenge addressed in this abstract is the mapping of short sequences (reads) to a reference genome by allowing mismatches. This is a significantly time consuming combinatorial problem in many applications including whole-genome resequencing, targeted sequencing, transcriptome/small RNA, DNA methylation and ChiP sequencing, and takes time on the order of days using existing sequential techniques on large scale datasets. In this work, we introduce six parallelization methods each having different scalability characteristics to speedup short sequence mapping. We also address an associated load balancing problem that involves grouping nodes of a tree from different levels. This problem arises due to a trade-off between computational cost and granularity while partitioning the workload. We comparatively present the proposed parallelization methods and give theoretical cost models for each of them. Experimental results on real datasets demonstrate the effectiveness of the methods and indicate that they are successful at reducing the execution time from the order of days to under just a few hours for large datasets. To the best of our knowledge this is the first study on parallelization of short sequence mapping problem

    The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies

    Get PDF
    Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity

    Effect of various normalization methods on Applied Biosystems expression array system data

    Get PDF
    BACKGROUND: DNA microarray technology provides a powerful tool for characterizing gene expression on a genome scale. While the technology has been widely used in discovery-based medical and basic biological research, its direct application in clinical practice and regulatory decision-making has been questioned. A few key issues, including the reproducibility, reliability, compatibility and standardization of microarray analysis and results, must be critically addressed before any routine usage of microarrays in clinical laboratory and regulated areas can occur. In this study we investigate some of these issues for the Applied Biosystems Human Genome Survey Microarrays. RESULTS: We analyzed the gene expression profiles of two samples: brain and universal human reference (UHR), a mixture of RNAs from 10 cancer cell lines, using the Applied Biosystems Human Genome Survey Microarrays. Five technical replicates in three different sites were performed on the same total RNA samples according to manufacturer's standard protocols. Five different methods, quantile, median, scale, VSN and cyclic loess were used to normalize AB microarray data within each site. 1,000 genes spanning a wide dynamic range in gene expression levels were selected for real-time PCR validation. Using the TaqMan(® )assays data set as the reference set, the performance of the five normalization methods was evaluated focusing on the following criteria: (1) Sensitivity and reproducibility in detection of expression; (2) Fold change correlation with real-time PCR data; (3) Sensitivity and specificity in detection of differential expression; (4) Reproducibility of differentially expressed gene lists. CONCLUSION: Our results showed a high level of concordance between these normalization methods. This is true, regardless of whether signal, detection, variation, fold change measurements and reproducibility were interrogated. Furthermore, we used TaqMan(® )assays as a reference, to generate TPR and FDR plots for the various normalization methods across the assay range. Little impact is observed on the TP and FP rates in detection of differentially expressed genes. Additionally, little effect was observed by the various normalization methods on the statistical approaches analyzed which indicates a certain robustness of the analysis methods currently in use in the field, particularly when used in conjunction with the Applied Biosystems Gene Expression System

    The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.</p> <p>Results</p> <p>Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan – the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (<it>P</it>) derived from widely used simple <it>t</it>-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent <it>P</it>-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on <it>P</it>-value ranking is an expected mathematical consequence of the high variability of the <it>t</it>-values; the more stringent the <it>P</it>-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.</p> <p>Conclusion</p> <p>We recommend the use of FC-ranking plus a non-stringent <it>P </it>cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the <it>P</it>-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and <it>P</it>-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the <it>P </it>criterion balances sensitivity and specificity.</p

    RNA-Seq Mapping and Detection of Gene Fusions with a Suffix Array Algorithm

    Get PDF
    High-throughput RNA sequencing enables quantification of transcripts (both known and novel), exon/exon junctions and fusions of exons from different genes. Discovery of gene fusions–particularly those expressed with low abundance– is a challenge with short- and medium-length sequencing reads. To address this challenge, we implemented an RNA-Seq mapping pipeline within the LifeScope software. We introduced new features including filter and junction mapping, annotation-aided pairing rescue and accurate mapping quality values. We combined this pipeline with a Suffix Array Spliced Read (SASR) aligner to detect chimeric transcripts. Performing paired-end RNA-Seq of the breast cancer cell line MCF-7 using the SOLiD system, we called 40 gene fusions among over 120,000 splicing junctions. We validated 36 of these 40 fusions with TaqMan assays, of which 25 were expressed in MCF-7 but not the Human Brain Reference. An intra-chromosomal gene fusion involving the estrogen receptor alpha gene ESR1, and another involving the RPS6KB1 (Ribosomal protein S6 kinase beta-1) were recurrently expressed in a number of breast tumor cell lines and a clinical tumor sample

    Gene Expression Signature in Peripheral Blood Detects Thoracic Aortic Aneurysm

    Get PDF
    BACKGROUND: Thoracic aortic aneurysm (TAA) is usually asymptomatic and associated with high mortality. Adverse clinical outcome of TAA is preventable by elective surgical repair; however, identifying at-risk individuals is difficult. We hypothesized that gene expression patterns in peripheral blood cells may correlate with TAA disease status. Our goal was to identify a distinct gene expression signature in peripheral blood that may identify individuals at risk for TAA. METHODS AND FINDINGS: Whole genome gene expression profiles from 94 peripheral blood samples (collected from 58 individuals with TAA and 36 controls) were analyzed. Significance Analysis of Microarray (SAM) identified potential signature genes characterizing TAA vs. normal, ascending vs. descending TAA, and sporadic vs. familial TAA. Using a training set containing 36 TAA patients and 25 controls, a 41-gene classification model was constructed for detecting TAA status and an overall accuracy of 78+/-6% was achieved. Testing this classifier on an independent validation set containing 22 TAA samples and 11 controls yielded an overall classification accuracy of 78%. These 41 classifier genes were further validated by TaqMan real-time PCR assays. Classification based on the TaqMan data replicated the microarray results and achieved 80% classification accuracy on the testing set. CONCLUSIONS: This study identified informative gene expression signatures in peripheral blood cells that can characterize TAA status and subtypes of TAA. Moreover, a 41-gene classifier based on expression signature can identify TAA patients with high accuracy. The transcriptional programs in peripheral blood leading to the identification of these markers also provide insights into the mechanism of development of aortic aneurysms and highlight potential targets for therapeutic intervention. The classifier genes identified in this study, and validated by TaqMan real-time PCR, define a set of promising potential diagnostic markers, setting the stage for a blood-based gene expression test to facilitate early detection of TAA

    Hierarchical Cluster Analysis of Myoepithelial/basal Cell Markers in Adenoid Cystic Carcinoma and Polymorphous Low-Grade Adenocarcinoma

    Get PDF
    Distinguishing adenoid cystic carcinoma from polymorphous low-grade adenocarcinoma of the salivary glands is important for their management. We studied the expression of several myoepithelial and basal/stem cell markers (smooth muscle actin, calponin, smooth muscle myosin heavy chain, metallothionein, maspin, and p63) by immunohistochemistry in 23 adenoid cystic carcinoma and 24 polymorphous low-grade adenocarcinoma, to identify the most useful marker or combination of markers that may help their diagnoses. The results were analyzed using hierarchical cluster analysis and χ2 test for trend. We noted diffuse expression of smooth muscle actin in 20 adenoid cystic carcinoma vs one polymorphous low-grade adenocarcinoma (Pvs one polymorphous low-grade adenocarcinoma (Pvs one polymorphous low-grade adenocarcinoma (P=0.001), metallothionein in 22 adenoid cystic carcinoma vs eight polymorphous low-grade adenocarcinoma (Pvs 14 polymorphous low-grade adenocarcinoma, and p63 in 21 adenoid cystic carcinoma vs 14 polymorphous low-grade adenocarcinoma. Hierarchical clustering of smooth muscle actin, calponin, smooth muscle myosin heavy chain, and metallothionein was virtually identical (κ≤0.0035), suggesting no significant advantage to their use in combination than individually. Diffuse smooth muscle actin expression showed the highest accuracy (91.5%) and positive predictive value (95.2%) for adenoid cystic carcinoma. Thus, diffuse expression of smooth muscle actin, calponin, smooth muscle myosin heavy chain, and metallothionein was highly predictive of adenoid cystic carcinoma, whereas maspin and p63 were frequently expressed in both tumors. In differentiating adenoid cystic carcinoma from polymorphous low-grade adenocarcinoma, smooth muscle actin as a single ancillary test in support of the histological findings, appears to be as efficient as multiple immunohistochemical tests
    corecore