476 research outputs found
ProbCD: enrichment analysis accounting for categorization uncertainty
As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test. We developed an open-source R package to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for
the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/. We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation
Patient and public involvement in health literacy interventions: a mapping review
Background: Health literacy is a critical mediating factor that impacts on the health of older adults. Patient and public involvement in health and social care research, policy and design of care delivery is one mechanism that can promote production of better health literacy. This mapping review looks for and describes practices, concepts and methods that have been reported involving patients, public and (non-researcher) professionals in the development and design of health literacy interventions for older people. Methods: Studies that aimed to improve health literacy were identified within a previously created compatible inventory of health behaviour studies for older people. Articles were screened for whether they addressed health literacy and featured involvement of stakeholders other than investigators and patients. Two reviewers independently read each study to identify any patient, public and professional involvement in the research process. We also noted some aspects of outcomes. Results: Twenty-two studies included patient, public and/or professional involvement in at least one research domain: design, management or evaluation. Involvement included volunteers, older people, professionals, patients, and community representatives. All studies were driven by an organisational or biomedical agenda. Conclusions: Patient, public and professional involvement wasrarely reported in studies on health literacy interventions for older people. This could help explain why some interventions fail to improve health literacy in older people. Key words – health literacy intervention research, older people, patient and public involvement, mapping revie
Classes of Multiple Decision Functions Strongly Controlling FWER and FDR
This paper provides two general classes of multiple decision functions where
each member of the first class strongly controls the family-wise error rate
(FWER), while each member of the second class strongly controls the false
discovery rate (FDR). These classes offer the possibility that an optimal
multiple decision function with respect to a pre-specified criterion, such as
the missed discovery rate (MDR), could be found within these classes. Such
multiple decision functions can be utilized in multiple testing, specifically,
but not limited to, the analysis of high-dimensional microarray data sets.Comment: 19 page
Gene set analysis exploiting the topology of a pathway
BACKGROUND: Recently, a great effort in microarray data analysis is directed towards the study of the so-called gene sets. A gene set is defined by genes that are, somehow, functionally related. For example, genes appearing in a known biological pathway naturally define a gene set. The gene sets are usually identified from a priori biological knowledge. Nowadays, many bioinformatics resources store such kind of knowledge (see, for example, the Kyoto Encyclopedia of Genes and Genomes, among others). Although pathways maps carry important information about the structure of correlation among genes that should not be neglected, the currently available multivariate methods for gene set analysis do not fully exploit it.
RESULTS: We propose a novel gene set analysis specifically designed for gene sets defined by pathways. Such analysis, based on graphical models, explicitly incorporates the dependence structure among genes highlighted by the topology of pathways. The analysis is designed to be used for overall surveillance of changes in a pathway in different experimental conditions. In fact, under different circumstances, not only the expression of the genes in a pathway, but also the strength of their relations may change. The methods resulting from the proposal allow both to test for variations in the strength of the links, and to properly account for heteroschedasticity in the usual tests for differential expression.
CONCLUSIONS: The use of graphical models allows a deeper look at the components of the pathway that can be tested separately and compared marginally. In this way it is possible to test single components of the pathway and highlight only those involved in its deregulation
A large scale survey reveals that chromosomal copy-number alterations significantly affect gene modules involved in cancer initiation and progression
Background
Recent observations point towards the existence of a large number of neighborhoods composed of functionally-related gene modules that lie together in the genome. This local component in the distribution of the functionality across chromosomes is probably affecting the own chromosomal architecture by limiting the possibilities in which genes can be arranged and distributed across the genome. As a direct consequence of this fact it is therefore presumable that diseases such as cancer, harboring DNA copy number alterations (CNAs), will have a symptomatology strongly dependent on modules of functionally-related genes rather than on a unique "important" gene.
Methods
We carried out a systematic analysis of more than 140,000 observations of CNAs in cancers and searched by enrichments in gene functional modules associated to high frequencies of loss or gains.
Results
The analysis of CNAs in cancers clearly demonstrates the existence of a significant pattern of loss of gene modules functionally related to cancer initiation and progression along with the amplification of modules of genes related to unspecific defense against xenobiotics (probably chemotherapeutical agents). With the extension of this analysis to an Array-CGH dataset (glioblastomas) from The Cancer Genome Atlas we demonstrate the validity of this approach to investigate the functional impact of CNAs.
Conclusions
The presented results indicate promising clinical and therapeutic implications. Our findings also directly point out to the necessity of adopting a function-centric, rather a gene-centric, view in the understanding of phenotypes or diseases harboring CNAs.Spanish Ministry of Science and Innovation (grant BIO2008-04212)Spanish Ministry of Science and Innovation (grant FIS PI 08/0440)GVA-FEDER (PROMETEO/2010/001)Red Temática de Investigación Cooperativa en Cáncer (RTICC) (grant RD06/0020/1019)Instituto de Salud Carlos III (ISCIII)Spanish Ministry of Science and InnovationSpanish Ministry of Health (FI06/00027
Improving gene-set enrichment analysis of RNA-Seq data with small replicates
Deregulated pathways identified from transcriptome data of two sample groups have played a key role in many genomic studies. Gene-set enrichment analysis (GSEA) has been commonly used for pathway or functional analysis of microarray data, and it is also being applied to RNA-seq data. However, most RNA-seq data so far have only small replicates. This enforces to apply the gene-permuting GSEA method (or preranked GSEA) which results in a great number of false positives due to the inter-gene correlation in each gene-set. We demonstrate that incorporating the absolute gene statistic in one-tailed GSEA considerably improves the false-positive control and the overall discriminatory ability of the gene-permuting GSEA methods for RNA-seq data. To test the performance, a simulation method to generate correlated read counts within a gene-set was newly developed, and a dozen of currently available RNA-seq enrichment analysis methods were compared, where the proposed methods outperformed others that do not account for the inter-gene correlation. Analysis of real RNA-seq data also supported the proposed methods in terms of false positive control, ranks of true positives and biological relevance. An efficient R package (AbsFilterG- SEA) coded with C++ (Rcpp) is available from CRAN.open
Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models
<p>Abstract</p> <p>Background</p> <p>Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.</p> <p>Results</p> <p>In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.</p> <p>Conclusion</p> <p>Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.</p
Adaptive Cluster Thresholding with Spatial Activation Guarantees Using All-resolutions Inference
Classical cluster inference is hampered by the spatial specificity paradox.
Given the null-hypothesis of no active voxels, the alternative hypothesis
states that there is at least one active voxel in a cluster. Hence, the larger
the cluster the less we know about where activation in the cluster is.
Rosenblatt et al. (2018) proposed a post-hoc inference method, All-resolutions
Inference (ARI), that addresses this paradox by estimating the number of active
voxels of any brain region. ARI allows users to choose arbitrary brain regions
and returns a simultaneous lower confidence bound of the true discovery
proportion (TDP) for each of them, retaining control of the family-wise error
rate. ARI does not, however, guide users to regions with high enough TDP. In
this paper, we propose an efficient algorithm that outputs all maximal
supra-threshold clusters, for which ARI gives a TDP lower confidence bound that
is at least a chosen threshold, for any number of thresholds that need not be
chosen a priori nor all at once. After a preprocessing step in linearithmic
time, the algorithm only takes linear time in the size of its output. We
demonstrate the algorithm with an application to two fMRI datasets. For both
datasets, we found several clusters whose TDP confidently meets or exceeds a
given threshold in less than a second
Unraveling genetic predisposition to familial or early onset gastric cancer using germline whole-exome sequencing
Recognition of individuals with a genetic predisposition to gastric cancer (GC) enables preventive measures. However, the underlying cause of genetic susceptibility to gastric cancer remains largely unexplained. We performed germline whole-exome sequencing on leukocyte DNA of 54 patients from 53 families with genetically unexplained diffuse-type and intestinal-type GC to identify novel GC-predisposing candidate genes. As young age at diagnosis and familial clustering are hallmarks of genetic tumor susceptibility, we selected patients that were diagnosed below the age of 35, patients from families with two cases of GC at or below age 60 and patients from families with three GC cases at or below age 70. All included individuals were tested negative for germline CDH1 mutations before or during the study. Variants that were possibly deleterious according to in silico predictions were filtered using several independent approaches that were based on gene function and gene mutation burden in controls. Despite a rigorous search, no obvious candidate GC predisposition genes were identified. This negative result stresses the importance of future research studies in large, homogeneous cohorts
Simultaneous confidence intervals for ranks with application to ranking institutions
When a ranking of institutions such as medical centers or universities is based on a numerical measure of performance provided with a standard error, confidence intervals (CIs) should be calculated to assess the uncertainty of these ranks. We present a novel method based on Tukey's honest significant difference test to construct simultaneous CIs for the true ranks. When all the true performances are equal, the probability of coverage of our method attains the nominal level. In case the true performance measures have no exact ties, our method is conservative. For this situation, we propose a rescaling method to the nominal level that results in shorter CIs while keeping control of the simultaneous coverage. We also show that a similar rescaling can be applied to correct a recently proposed Monte-Carlo based method, which is anticonservative. After rescaling, the two methods perform very similarly. However, the rescaling of the Monte-Carlo based method is computationally much more demanding and becomes infeasible when the number of institutions is larger than 30-50. We discuss another recently proposed method similar to ours based on simultaneous CIs for the true performance. We show that our method provides uniformly shorter CIs for the same confidence level. We illustrate the superiority of our new methods with a data analysis for travel time to work in the United States and on rankings of 64 hospitals in the Netherlands.Development and application of statistical models for medical scientific researc
- …
