23,114 research outputs found
A novel approach to minimize false discovery rate in genome-wide data analysis
BACKGROUND: High-throughput technologies, such as DNA microarray, have significantly advanced biological and biomedical research by enabling researchers to carry out genome-wide screens. One critical task in analyzing genome-wide datasets is to control the false discovery rate (FDR) so that the proportion of false positive features among those called significant is restrained. Recently a number of FDR control methods have been proposed and widely practiced, such as the Benjamini-Hochberg approach, the Storey approach and Significant Analysis of Microarrays (SAM). METHODS: This paper presents a straight-forward yet powerful FDR control method termed miFDR, which aims to minimize FDR when calling a fixed number of significant features. We theoretically proved that the strategy used by miFDR is able to find the optimal number of significant features when the desired FDR is fixed. RESULTS: We compared miFDR with the BH approach, the Storey approach and SAM on both simulated datasets and public DNA microarray datasets. The results demonstrated that miFDR outperforms others by identifying more significant features under the same FDR cut-offs. Literature search showed that many genes called only by miFDR are indeed relevant to the underlying biology of interest. CONCLUSIONS: FDR has been widely applied to analyzing high-throughput datasets allowed for rapid discoveries. Under the same FDR threshold, miFDR is capable to identify more significant features than its competitors at a compatible level of complexity. Therefore, it can potentially generate great impacts on biological and biomedical research. AVAILABILITY: If interested, please contact the authors for getting miFDR
Integrating gene and protein expression data with genome-scale metabolic networks to infer functional pathways
This article has been made available through the Brunel Open Access Publishing Fund. Copyright @ 2013 Pey et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.Background: The study of cellular metabolism in the context of high-throughput -omics data has allowed us to decipher novel mechanisms of importance in biotechnology and health. To continue with this progress, it is essential to efficiently integrate experimental data into metabolic modeling. Results: We present here an in-silico framework to infer relevant metabolic pathways for a particular phenotype under study based on its gene/protein expression data. This framework is based on the Carbon Flux Path (CFP) approach, a mixed-integer linear program that expands classical path finding techniques by considering additional biophysical constraints. In particular, the objective function of the CFP approach is amended to account for gene/protein expression data and influence obtained paths. This approach is termed integrative Carbon Flux Path (iCFP). We show that gene/protein expression data also influences the stoichiometric balancing of CFPs, which provides a more accurate picture of active metabolic pathways. This is illustrated in both a theoretical and real scenario. Finally, we apply this approach to find novel pathways relevant in the regulation of acetate overflow metabolism in Escherichia coli. As a result, several targets which could be relevant for better understanding of the phenomenon leading to impaired acetate overflow are proposed. Conclusions:
A novel mathematical framework that determines functional pathways based on gene/protein expression data is presented and validated. We show that our approach is able to provide new insights into complex biological scenarios such as acetate overflow in Escherichia coli.Basque Governmen
Recommended from our members
Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy
Methodological Issues in Multistage Genome-Wide Association Studies
Because of the high cost of commercial genotyping chip technologies, many
investigations have used a two-stage design for genome-wide association
studies, using part of the sample for an initial discovery of ``promising''
SNPs at a less stringent significance level and the remainder in a joint
analysis of just these SNPs using custom genotyping. Typical cost savings of
about 50% are possible with this design to obtain comparable levels of overall
type I error and power by using about half the sample for stage I and carrying
about 0.1% of SNPs forward to the second stage, the optimal design depending
primarily upon the ratio of costs per genotype for stages I and II. However,
with the rapidly declining costs of the commercial panels, the generally low
observed ORs of current studies, and many studies aiming to test multiple
hypotheses and multiple endpoints, many investigators are abandoning the
two-stage design in favor of simply genotyping all available subjects using a
standard high-density panel. Concern is sometimes raised about the absence of a
``replication'' panel in this approach, as required by some high-profile
journals, but it must be appreciated that the two-stage design is not a
discovery/replication design but simply a more efficient design for discovery
using a joint analysis of the data from both stages. Once a subset of
highly-significant associations has been discovered, a truly independent
``exact replication'' study is needed in a similar population of the same
promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcription factor oncogenes
ChIP-on-chip has emerged as a powerful tool to dissect the complex network of regulatory interactions between transcription factors and their targets. However, most ChIP-on-chip analysis methods use conservative approaches aimed to minimize false-positive transcription factor targets. We present a model with improved sensitivity in detecting binding events from ChIP-on-chip data. Biochemically validated analysis in human T-cells reveals that three transcription factor oncogenes, NOTCH1, MYC, and HES1, bind one order of magnitude more promoters than previously thought. Gene expression profiling upon NOTCH1 inhibition shows broad-scale functional regulation across the entire range of predicted target genes, establishing a closer link between occupancy and regulation. Finally, the resolution of a more complete map of transcriptional targets reveals that MYC binds nearly all promoters bound by NOTCH1. Overall, these results suggest an unappreciated complexity of transcriptional regulatory networks and highlight the fundamental importance of genome-scale analysis to represent transcriptional programs
Recommended from our members
GenEpi: gene-based epistasis discovery using machine learning.
BackgroundGenome-wide association studies (GWAS) provide a powerful means to identify associations between genetic variants and phenotypes. However, GWAS techniques for detecting epistasis, the interactions between genetic variants associated with phenotypes, are still limited. We believe that developing an efficient and effective GWAS method to detect epistasis will be a key for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer's disease (AD).ResultsIn this regard, this study presents GenEpi, a computational package to uncover epistasis associated with phenotypes by the proposed machine learning approach. GenEpi identifies both within-gene and cross-gene epistasis through a two-stage modeling workflow. In both stages, GenEpi adopts two-element combinatorial encoding when producing features and constructs the prediction models by L1-regularized regression with stability selection. The simulated data showed that GenEpi outperforms other widely-used methods on detecting the ground-truth epistasis. As real data is concerned, this study uses AD as an example to reveal the capability of GenEpi in finding disease-related variants and variant interactions that show both biological meanings and predictive power.ConclusionsThe results on simulation data and AD demonstrated that GenEpi has the ability to detect the epistasis associated with phenotypes effectively and efficiently. The released package can be generalized to largely facilitate the studies of many complex diseases in the near future
- …