4,122 research outputs found
Graph based fusion of high-dimensional gene- and microRNA expression data
One of the main goals in cancer studies including high-throughput microRNA
(miRNA) and mRNA data is to find and assess prognostic signatures capable
of predicting clinical outcome. Both mRNA and miRNA expression changes in
cancer diseases are described to reflect clinical characteristics like staging and
prognosis. Furthermore, miRNA abundance can directly affect target transcripts
and translation in tumor cells. Prediction models are trained to identify either
mRNA or miRNA signatures for patient stratification. With the increasing
number of microarray studies collecting mRNA and miRNA from the same
patient cohort there is a need for statistical methods to integrate or fuse both
kinds of data into one prediction model in order to find a combined signature
that improves the prediction.
Here, we propose a new method to fuse miRNA and mRNA data into one
prediction model. Since miRNAs are known regulators of mRNAs, correlations
between miRNA and mRNA expression data as well as target prediction
information were used to build a bipartite graph representing the relations
between miRNAs and mRNAs.
Feature selection is a critical part when fitting prediction models to high-
dimensional data. Most methods treat features, in this case genes or miRNAs,
as independent, an assumption that does not hold true when dealing with
combined gene and miRNA expression data. To improve prediction accuracy, a
description of the correlation structure in the data is needed. In this work the
bipartite graph was used to guide the feature selection and therewith improve
prediction results and find a stable prognostic signature of miRNAs and genes.
The method is evaluated on a prostate cancer data set comprising 98 patient
samples with miRNA and mRNA expression data. The biochemical relapse, an
important event in prostate cancer treatment, was used as clinical endpoint.
Biochemical relapse coins the renewed rise of the blood level of a prostate
marker (PSA) after surgical removal of the prostate. The relapse is a hint
for metastases and usually the point in clinical practise to decide for further
treatment.
A boosting approach was used to predict the biochemical relapse. It could
be shown that the bipartite graph in combination with miRNA and mRNA
expression data could improve prediction performance. Furthermore the ap-
proach improved the stability of the feature selection and therewith yielded
more consistent marker sets. Of course, the marker sets produced by this new
method contain mRNAs as well as miRNAs.
The new approach was compared to two state-of-the-art methods suited for
high-dimensional data and showed better prediction performance in both cases
Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data
BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones
Over-optimism in bioinformatics: an illustration
In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the method’s characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significance”. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets
Novel image markers for non-small cell lung cancer classification and survival prediction
BACKGROUND: Non-small cell lung cancer (NSCLC), the most common type of lung cancer, is one of serious diseases causing death for both men and women. Computer-aided diagnosis and survival prediction of NSCLC, is of great importance in providing assistance to diagnosis and personalize therapy planning for lung cancer patients.
RESULTS: In this paper we have proposed an integrated framework for NSCLC computer-aided diagnosis and survival analysis using novel image markers. The entire biomedical imaging informatics framework consists of cell detection, segmentation, classification, discovery of image markers, and survival analysis. A robust seed detection-guided cell segmentation algorithm is proposed to accurately segment each individual cell in digital images. Based on cell segmentation results, a set of extensive cellular morphological features are extracted using efficient feature descriptors. Next, eight different classification techniques that can handle high-dimensional data have been evaluated and then compared for computer-aided diagnosis. The results show that the random forest and adaboost offer the best classification performance for NSCLC. Finally, a Cox proportional hazards model is fitted by component-wise likelihood based boosting. Significant image markers have been discovered using the bootstrap analysis and the survival prediction performance of the model is also evaluated.
CONCLUSIONS: The proposed model have been applied to a lung cancer dataset that contains 122 cases with complete clinical information. The classification performance exhibits high correlations between the discovered image markers and the subtypes of NSCLC. The survival analysis demonstrates strong prediction power of the statistical model built from the discovered image markers
- …