13 research outputs found

    On consensus biomarker selection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent development of mass spectrometry technology enabled the analysis of complex peptide mixtures. A lot of effort is currently devoted to the identification of biomarkers in human body fluids like serum or plasma, based on which new diagnostic tests for different diseases could be constructed. Various biomarker selection procedures have been exploited in recent studies. It has been noted that they often lead to different biomarker lists and as a consequence, the patient classification may also vary.</p> <p>Results</p> <p>Here we propose a new approach to the biomarker selection problem: to apply several competing feature ranking procedures and compute a consensus list of features based on their outcomes. We validate our methods on two proteomic datasets for the diagnosis of ovarian and prostate cancer.</p> <p>Conclusion</p> <p>The proposed methodology can improve the classification results and at the same time provide a unified biomarker list for further biological examinations and interpretation.</p

    Assessing similarity of feature selection techniques in high-dimensional domains

    Get PDF
    Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an “ad hoc” basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    An adaptive ensemble learner function via bagging and rank aggregation with applications to high dimensional data.

    Get PDF
    An ensemble consists of a set of individual predictors whose predictions are combined. Generally, different classification and regression models tend to work well for different types of data and also, it is usually not know which algorithm will be optimal in any given application. In this thesis an ensemble regression function is presented which is adapted from Datta et al. 2010. The ensemble function is constructed by combining bagging and rank aggregation that is capable of changing its performance depending on the type of data that is being used. In the classification approach, the results can be optimized with respect to performance measures such as accuracy, sensitivity, specificity and area under the curve (AUC) whereas in the regression approach, it can be optimized with respect to measures such as mean square error and mean absolute error. The ensemble classifier and ensemble regressor performs at the level of the best individual classifier or regression model. For complex high-dimensional datasets, it may be advisable to combine a number of classification algorithms or regression algorithms rather than using one specific algorithm

    Big DNA Datasets Analysis under Push down Automata

    Get PDF
    Consensus is a significant part that supports the identification of unknown information about animals, plants and insects around the globe. It represents a small part of Deoxyribonucleic acid (DNA) known as the DNA segment that carries all the information for investigation and verification. However, excessive datasets are the major challenges to mine the accurate meaning of the experiments. The datasets are increasing exponentially in ever seconds. In the present article, a memory saving consensus finding approach is organized. The principal component analysis (PCA) and independent component (ICA) are used to pre-process the training datasets. A comparison is carried out between these approaches with the Apriori algorithm. Furthermore, the push down automat (PDA) is applied for superior memory utilization. It iteratively frees the memory for storing targeted consensus by removing all the datasets that are not matched with the consensus. Afterward, the Apriori algorithm selects the desired consensus from limited values that are stored by the PDA. Finally, the Gauss-Seidel method is used to verify the consensus mathematically

    Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies

    Get PDF
    The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study

    Dynamic And Quantitative Radiomics Analysis In Interventional Radiology

    Get PDF
    Interventional Radiology (IR) is a subspecialty of radiology that performs invasive procedures driven by diagnostic imaging for predictive and therapeutic purpose. The development of artificial intelligence (AI) has revolutionized the industry of IR. Researchers have created sophisticated models backed by machine learning algorithms and optimization methodologies for image registration, cellular structure detection and computer-aided disease diagnosis and prognosis predictions. However, due to the incapacity of the human eye to detect tiny structural characteristics and inter-radiologist heterogeneity, conventional experience-based IR visual evaluations may have drawbacks. Radiomics, a technique that utilizes machine learning, offers a practical and quantifiable solution to this issue. This technology has been used to evaluate the heterogeneity of malignancies that are difficult to detect by the human eye by creating an automated pipeline for the extraction and analysis of high throughput computational imaging characteristics from radiological medical pictures. However, it is a demanding task to directly put radiomics into applications in IR because of the heterogeneity and complexity of medical imaging data. Furthermore, recent radiomics studies are based on static images, while many clinical applications (such as detecting the occurrence and development of tumors and assessing patient response to chemotherapy and immunotherapy) is a dynamic process. Merely incorporating static features cannot comprehensively reflect the metabolic characteristics and dynamic processes of tumors or soft tissues. To address these issues, we proposed a robust feature selection framework to manage the high-dimensional small-size data. Apart from that, we explore and propose a descriptor in the view of computer vision and physiology by integrating static radiomics features with time-varying information in tumor dynamics. The major contributions to this study include: Firstly, we construct a result-driven feature selection framework, which could efficiently reduce the dimension of the original feature set. The framework integrates different feature selection techniques to ensure the distinctiveness, uniqueness, and generalization ability of the output feature set. In the task of classification hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) in primary liver cancer, only three radiomics features (chosen from more than 1, 800 features of the proposed framework) can obtain an AUC of 0.83 in the independent dataset. Besides, we also analyze features’ pattern and contributions to the results, enhancing clinical interpretability of radiomics biomarkers. Secondly, we explore and build a pulmonary perfusion descriptor based on 18F-FDG whole-body dynamic PET images. Our major novelties include: 1) propose a physiology-and-computer-vision-interpretable descriptor construction framework by the decomposition of spatiotemporal information into three dimensions: shades of grey levels, textures, and dynamics. 2) The spatio-temporal comparison of pulmonary descriptor intra and inter patients is feasible, making it possible to be an auxiliary diagnostic tool in pulmonary function assessment. 3) Compared with traditional PET metabolic biomarker analysis, the proposed descriptor incorporates image’s temporal information, which enables a better understanding of the time-various mechanisms and detection of visual perfusion abnormalities among different patients. 4) The proposed descriptor eliminates the impact of vascular branching structure and gravity effect by utilizing time warping algorithms. Our experimental results showed that our proposed framework and descriptor are promising tools to medical imaging analysis

    Integrative Data Mining and Meta Analysis of Disease-Specific Large-Scale Genomic,Transcriptomic and Proteomic Data

    Get PDF
    During the past decades, large-scale microarray technologies have been applied to the field of genomics, transcriptomics and proteomics. DNA microarrays and mass spectrometry have been used as tools for identifying changes in gene- and protein expression and genomic alterations that can be linked to various stages of tumor development. Although these technologies have generated a deluge of data, bioinformatic algorithms still need to be improved to advance the understanding of many biological fundamental questions. In particular, most bioinformatic strategies are optimized for one of these technologies and only allow for an one dimensional view on the biological question. Within this thesis a bioinformatic tool was developed that combines the multidimensional information that can be obtained when analysing genomic, transcriptomic and proteomic data in an integrative manner. Neuroblastoma is a malignant pediatric tumor of the nervous system. The tumor is characterized by aberration patterns that correlate with patient outcome. aCGH (array comparative genomic hybridization) and DNA-microrarray gene expression analysis were choosen as appropriate methods to analyse the impact of DNA copy number variations on gene expression in 81 neuroblastoma samples. Within this thesis a novel bioinformatic strategy was used which identifies chromosomal aberrations that influence the expression of genes located at the same (cis-effects) and also at different (trans-effects) chromosomal positions in neuroblastoma. Sample specific cis-effects were identified for the paired data by a probe-matching procedure, gene expression discretization and a correlation score in combination with one-dimensional hierarchical clustering. The graphical representation revealed that tumors with an amplification of the oncogene MYCN had a gain of chromosome 17 whereas genes in cis-position were downregulated. Simultaneously, a loss of chromosome 1 and a downregulation of the corresponding genes hint towards a crossrelationship between chromosome 17 and 1. A Bayesian network (BN) as representation of joint probability distributions was adopted to detect neuroblastoma specific cis- and trans-effects. The strength of association between aCGH and gene expression data was represented by markov blankets, which where build up by mutual information. This gave rise to a graphical network that linked DNA copy number changes with genes and also gene-gene interactions. This method found chromosomal aberrations on 11q and 17q to have a major impact on neuroblastoma. A prominent trans-effect was identified by a gain of 17q.23.2 and an upregulation of CPT1B which is located at 22.q13.33. Further, to identify the effects of gene expression changes on the protein expression the bioinformatic tool was expanded to enable an integration of mass spectrometry and DNA-microrarray data of a set of 53 patients after lung transplantation. The tool was applied for early diagnosis of the Bronchiolitis Obliterans Syndrome (BOS) which occurs often in the second year after lung transplantation and leads to a repulsion of the lung transplant. Gene expression profiles were translated into virtual spectra and linked to their potential mass spectrometry peak. The correlation score between the virtual and real spectra did not exhibit significant patterns in relation to BOS. However, the metaanalysis approach resulted in 15 genes that could not be found in the seperate analysis of the two data types such as INSL4, CCL26 and FXYD3. These genes constitute potential biomarkers for the detection of BO

    Application of Novel Statistical Methods for Biomarker Selection to HIV Infection Data

    Get PDF
    The past decade has seen an explosion in the availability and use of biomarker data as a result of innovative discoveries and recent development of new biological and molecular techniques. Biomarkers are essential for at least four key purposes in biomedical research and public health practice: they are used for disease detection, diagnosis, prognosis, to identify patients who are most likely to benefit from selected therapies, and to guide clinical decision making. Determining the predictive and diagnostic value of these biomarkers, singly or in combination, is essential to their being used effectively, and this has spurred the development of new statistical methodologies to assess the relationship between biomarkers and clinical outcomes. One active area of research is the development of variable importance measures, a class of estimators that could reliably capture the effect of a specific biomarker on a clinical outcome. The central question addressed in this dissertation is the following: Given a large set of biomarkers that potentially predict a clinical outcome, how can one make a determination as to which ones are the most important? In the first paper, we estimate a targeted variable importance measure through Van der Laan's theory of targeted maximum likelihood estimation in the point treatment setting and use the same objective function to compute an alternative measure of marginal variable importance based on weights from a flexible propensity score model. Covariate-adjusted targeted variable importance measures are compared to estimates from this alternative methodology and to incremental value estimates from partial ROC curves. In the second paper, we extend the applicability of the TMLE methodology to analyze longitudinal repeated measures data. It addresses the gap caused by the absence of a generally accepted approach for generating a longitudinal variable importance index by proposing an estimator involving both TMLE and computation of the area under or above the LOESS curve. A graphical method is proposed for visual assessment of the longevity of a biomarker in terms of its predictive power, information that could be used to determine when repeated measures of a biomarker should be taken. Finally, in the third paper we take right censoring in the outcome variable into consideration and achieve biomarker selection in the presence of confounding and potential informative censoring through the use of stabilized weights in a time-dependent Cox proportional hazards model. A dataset from the Hormonal Contraception and HIV Genital Shedding and Disease Progression Study that includes longitudinal HIV infection data on a sample of 306 HIV-infected adult women from Uganda and Zimbabwe was used to develop and evaluate the methods discussed in the three papers. This study collected information on a number of biomarkers related to HIV infection, including plasma viral load, HIV subtype, CD4 and CD8 lymphocyte counts, hemoglobin level, and herpes simplex virus 2 (HSV-2). The relationships of these biomarkers with changes in CD4 cell counts were considered in three different contexts: cross-sectional, longitudinal and survival. In short, baseline CD4 cell counts, HIV subtype, and HSV-2 were found to be important biomarkers for the outcome variable studied
    corecore