36 research outputs found
Parallelization of Mapping Algorithms for Next Generation Sequencing Applications
With the advent of next-generation high throughput sequencing
instruments, large volumes of short sequence data are generated at an
unprecedented rate. Processing and analyzing these massive data
requires overcoming several challenges. A particular challenge
addressed in this abstract is the mapping of short sequences (reads)
to a reference genome by allowing mismatches. This is a significantly
time consuming combinatorial problem in many applications including
whole-genome resequencing, targeted sequencing, transcriptome/small
RNA, DNA methylation and ChiP sequencing, and takes time on the order
of days using existing sequential techniques on large scale
datasets. In this work, we introduce six parallelization methods each
having different scalability characteristics to speedup short sequence
mapping. We also address an associated load balancing problem that
involves grouping nodes of a tree from different levels. This problem
arises due to a trade-off between computational cost and granularity
while partitioning the workload. We comparatively present the
proposed parallelization methods and give theoretical cost models for
each of them. Experimental results on real datasets demonstrate the
effectiveness of the methods and indicate that they are successful at
reducing the execution time from the order of days to under just a few
hours for large datasets.
To the best of our knowledge this is the first study on
parallelization of short sequence mapping problem
The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies
Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity
Effect of various normalization methods on Applied Biosystems expression array system data
BACKGROUND: DNA microarray technology provides a powerful tool for characterizing gene expression on a genome scale. While the technology has been widely used in discovery-based medical and basic biological research, its direct application in clinical practice and regulatory decision-making has been questioned. A few key issues, including the reproducibility, reliability, compatibility and standardization of microarray analysis and results, must be critically addressed before any routine usage of microarrays in clinical laboratory and regulated areas can occur. In this study we investigate some of these issues for the Applied Biosystems Human Genome Survey Microarrays. RESULTS: We analyzed the gene expression profiles of two samples: brain and universal human reference (UHR), a mixture of RNAs from 10 cancer cell lines, using the Applied Biosystems Human Genome Survey Microarrays. Five technical replicates in three different sites were performed on the same total RNA samples according to manufacturer's standard protocols. Five different methods, quantile, median, scale, VSN and cyclic loess were used to normalize AB microarray data within each site. 1,000 genes spanning a wide dynamic range in gene expression levels were selected for real-time PCR validation. Using the TaqMan(® )assays data set as the reference set, the performance of the five normalization methods was evaluated focusing on the following criteria: (1) Sensitivity and reproducibility in detection of expression; (2) Fold change correlation with real-time PCR data; (3) Sensitivity and specificity in detection of differential expression; (4) Reproducibility of differentially expressed gene lists. CONCLUSION: Our results showed a high level of concordance between these normalization methods. This is true, regardless of whether signal, detection, variation, fold change measurements and reproducibility were interrogated. Furthermore, we used TaqMan(® )assays as a reference, to generate TPR and FDR plots for the various normalization methods across the assay range. Little impact is observed on the TP and FP rates in detection of differentially expressed genes. Additionally, little effect was observed by the various normalization methods on the statistical approaches analyzed which indicates a certain robustness of the analysis methods currently in use in the field, particularly when used in conjunction with the Applied Biosystems Gene Expression System
A tissue-specific landscape of sense/antisense transcription in the mouse intestine
<p>Abstract</p> <p>Background</p> <p>The intestinal mucosa is characterized by complex metabolic and immunological processes driven highly dynamic gene expression programs. With the advent of next generation sequencing and its utilization for the analysis of the RNA sequence space, the level of detail on the global architecture of the transcriptome reached a new order of magnitude compared to microarrays.</p> <p>Results</p> <p>We report the ultra-deep characterization of the polyadenylated transcriptome in two closely related, yet distinct regions of the mouse intestinal tract (small intestine and colon). We assessed tissue-specific transcriptomal architecture and the presence of novel transcriptionally active regions (nTARs). In the first step, signatures of 20,541 NCBI RefSeq transcripts could be identified in the intestine (74.1% of annotated genes), thereof 16,742 are common in both tissues. Although the majority of reads could be linked to annotated genes, 27,543 nTARs not consistent with current gene annotations in RefSeq or ENSEMBL were identified. By use of a second independent strand-specific RNA-Seq protocol, 20,966 of these nTARs were confirmed, most of them in vicinity of known genes. We further categorized our findings by their relative adjacency to described exonic elements and investigated regional differences of novel transcribed elements in small intestine and colon.</p> <p>Conclusions</p> <p>The current study demonstrates the complexity of an archetypal mammalian intestinal mRNA transcriptome in high resolution and identifies novel transcriptionally active regions at strand-specific, single base resolution. Our analysis for the first time shows a strand-specific comparative picture of nTARs in two tissues and represents a resource for further investigating the transcriptional processes that contribute to tissue identity.</p
The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies
<p>Abstract</p> <p>Background</p> <p>Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.</p> <p>Results</p> <p>Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan – the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (<it>P</it>) derived from widely used simple <it>t</it>-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent <it>P</it>-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on <it>P</it>-value ranking is an expected mathematical consequence of the high variability of the <it>t</it>-values; the more stringent the <it>P</it>-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.</p> <p>Conclusion</p> <p>We recommend the use of FC-ranking plus a non-stringent <it>P </it>cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the <it>P</it>-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and <it>P</it>-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the <it>P </it>criterion balances sensitivity and specificity.</p
Evaluating methods for ranking differentially expressed genes applied to microArray quality control data
<p>Abstract</p> <p>Background</p> <p>Statistical methods for ranking differentially expressed genes (DEGs) from gene expression data should be evaluated with regard to high sensitivity, specificity, and reproducibility. In our previous studies, we evaluated eight gene ranking methods applied to only Affymetrix GeneChip data. A more general evaluation that also includes other microarray platforms, such as the Agilent or Illumina systems, is desirable for determining which methods are suitable for each platform and which method has better inter-platform reproducibility.</p> <p>Results</p> <p>We compared the eight gene ranking methods using the MicroArray Quality Control (MAQC) datasets produced by five manufacturers: Affymetrix, Applied Biosystems, Agilent, GE Healthcare, and Illumina. The area under the curve (AUC) was used as a measure for both sensitivity and specificity. Although the highest AUC values can vary with the definition of "true" DEGs, the best methods were, in most cases, either the weighted average difference (WAD), rank products (RP), or intensity-based moderated <it>t </it>statistic (ibmT). The percentages of overlapping genes (POGs) across different test sites were mainly evaluated as a measure for both intra- and inter-platform reproducibility. The POG values for WAD were the highest overall, irrespective of the choice of microarray platform. The high intra- and inter-platform reproducibility of WAD was also observed at a higher biological function level.</p> <p>Conclusion</p> <p>These results for the five microarray platforms were consistent with our previous ones based on 36 real experimental datasets measured using the Affymetrix platform. Thus, recommendations made using the MAQC benchmark data might be universally applicable.</p
RNA-Seq Mapping and Detection of Gene Fusions with a Suffix Array Algorithm
High-throughput RNA sequencing enables quantification of transcripts (both known and novel), exon/exon junctions and fusions of exons from different genes. Discovery of gene fusions–particularly those expressed with low abundance– is a challenge with short- and medium-length sequencing reads. To address this challenge, we implemented an RNA-Seq mapping pipeline within the LifeScope software. We introduced new features including filter and junction mapping, annotation-aided pairing rescue and accurate mapping quality values. We combined this pipeline with a Suffix Array Spliced Read (SASR) aligner to detect chimeric transcripts. Performing paired-end RNA-Seq of the breast cancer cell line MCF-7 using the SOLiD system, we called 40 gene fusions among over 120,000 splicing junctions. We validated 36 of these 40 fusions with TaqMan assays, of which 25 were expressed in MCF-7 but not the Human Brain Reference. An intra-chromosomal gene fusion involving the estrogen receptor alpha gene ESR1, and another involving the RPS6KB1 (Ribosomal protein S6 kinase beta-1) were recurrently expressed in a number of breast tumor cell lines and a clinical tumor sample
Serological Profiling of a Candida albicans Protein Microarray Reveals Permanent Host-Pathogen Interplay and Stage-Specific Responses during Candidemia
Candida albicans in the immunocompetent host is a benign member of the human microbiota. Though, when host physiology is disrupted, this commensal-host interaction can degenerate and lead to an opportunistic infection. Relatively little is known regarding the dynamics of C. albicans colonization and pathogenesis. We developed a C. albicans cell surface protein microarray to profile the immunoglobulin G response during commensal colonization and candidemia. The antibody response from the sera of patients with candidemia and our negative control groups indicate that the immunocompetent host exists in permanent host-pathogen interplay with commensal C. albicans. This report also identifies cell surface antigens that are specific to different phases (i.e. acute, early and mid convalescence) of candidemia. We identified a set of thirteen cell surface antigens capable of distinguishing acute candidemia from healthy individuals and uninfected hospital patients with commensal colonization. Interestingly, a large proportion of these cell surface antigens are involved in either oxidative stress or drug resistance. In addition, we identified 33 antigenic proteins that are enriched in convalescent sera of the candidemia patients. Intriguingly, we found within this subset an increase in antigens associated with heme-associated iron acquisition. These findings have important implications for the mechanisms of C. albicans colonization as well as the development of systemic infection
Effects of the total replacement of fish-based diet with plant-based diet on the hepatic transcriptome of two European sea bass (Dicentrarchus labrax) half-sibfamilies showing different growth rates with the plant-based diet
Background: Efforts towards utilisation of diets without fish meal (FM) or fish oil (FO) in finfish aquaculture have been being made for more than two decades. Metabolic responses to substitution of fishery products have been shown to impact growth performance and immune system of fish as well as their subsequent nutritional value, particularly in marine fish species, which exhibit low capacity for biosynthesis of long-chain poly-unsaturated fatty acids (LC-PUFA). The main objective of the present study was to analyse the effects of a plant-based diet on the hepatic transcriptome of European sea bass (Dicentrarchus labrax). Results: We report the first results obtained using a transcriptomic approach on the liver of two half-sibfamilies of the European sea bass that exhibit similar growth rates when fed a fish-based diet (FD), but significantly different growth rates when fed an all-plant diet (VD). Overall gene expression was analysed using oligo DNA microarrays (GPL9663). Statistical analysis identified 582 unique annotated genes differentially expressed between groups of fish fed the two diets, 199 genes regulated by genetic factors, and 72 genes that exhibited diet-family interactions. The expression of several genes involved in the LC-PUFA and cholesterol biosynthetic pathways was found to be up-regulated in fish fed VD, suggesting a stimulation of the lipogenic pathways. No significant diet-family interaction for the regulation of LC-PUFA biosynthesis pathways could be detected by microarray analysis. This result was in agreement with LC-PUFA profiles, which were found to be similar in the flesh of the two half-sibfamilies. In addition, the combination of our transcriptomic data with an analysis of plasmatic immune parameters revealed a stimulation of complement activity associated with an immunodeficiency in the fish fed VD, and different inflammatory status between the two half-sibfamilies. Biological processes related to protein catabolism, amino acid transaminations, RNA splicing and blood coagulation were also found to be regulated by diet, while the expression of genes involved in protein and ATP synthesis differed between the half-sibfamilies. Conclusions: Overall, the combined gene expression, compositional and biochemical studies demonstrated a large panel of metabolic and physiological effects induced by total substitution of both FM and FO in the diets of European sea bass and revealed physiological characteristics associated with the two half-sibfamilies
Gene Expression Signature in Peripheral Blood Detects Thoracic Aortic Aneurysm
BACKGROUND: Thoracic aortic aneurysm (TAA) is usually asymptomatic and associated with high mortality. Adverse clinical outcome of TAA is preventable by elective surgical repair; however, identifying at-risk individuals is difficult. We hypothesized that gene expression patterns in peripheral blood cells may correlate with TAA disease status. Our goal was to identify a distinct gene expression signature in peripheral blood that may identify individuals at risk for TAA. METHODS AND FINDINGS: Whole genome gene expression profiles from 94 peripheral blood samples (collected from 58 individuals with TAA and 36 controls) were analyzed. Significance Analysis of Microarray (SAM) identified potential signature genes characterizing TAA vs. normal, ascending vs. descending TAA, and sporadic vs. familial TAA. Using a training set containing 36 TAA patients and 25 controls, a 41-gene classification model was constructed for detecting TAA status and an overall accuracy of 78+/-6% was achieved. Testing this classifier on an independent validation set containing 22 TAA samples and 11 controls yielded an overall classification accuracy of 78%. These 41 classifier genes were further validated by TaqMan real-time PCR assays. Classification based on the TaqMan data replicated the microarray results and achieved 80% classification accuracy on the testing set. CONCLUSIONS: This study identified informative gene expression signatures in peripheral blood cells that can characterize TAA status and subtypes of TAA. Moreover, a 41-gene classifier based on expression signature can identify TAA patients with high accuracy. The transcriptional programs in peripheral blood leading to the identification of these markers also provide insights into the mechanism of development of aortic aneurysms and highlight potential targets for therapeutic intervention. The classifier genes identified in this study, and validated by TaqMan real-time PCR, define a set of promising potential diagnostic markers, setting the stage for a blood-based gene expression test to facilitate early detection of TAA