9,250 research outputs found
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Establishment of a integrative multi-omics expression database CKDdb in the context of chronic kidney disease (CKD)
Complex human traits such as chronic kidney disease (CKD) are a major health and financial burden in modern societies. Currently, the description of the CKD onset and progression at the molecular level is still not fully understood. Meanwhile, the prolific use of high-throughput omic technologies in disease biomarker discovery studies yielded a vast amount of disjointed data that cannot be easily collated. Therefore, we aimed to develop a molecule-centric database featuring CKD-related experiments from available literature publications. We established the Chronic Kidney Disease database CKDdb, an integrated and clustered information resource that covers multi-omic studies (microRNAs, genomics, peptidomics, proteomics and metabolomics) of CKD and related disorders by performing literature data mining and manual curation. The CKDdb database contains differential expression data from 49395 molecule entries (redundant), of which 16885 are unique molecules (non-redundant) from 377 manually curated studies of 230 publications. This database was intentionally built to allow disease pathway analysis through a systems approach in order to yield biological meaning by integrating all existing information and therefore has the potential to unravel and gain an in-depth understanding of the key molecular events that modulate CKD pathogenesis
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
EFSIS: Ensemble Feature Selection Integrating Stability
Ensemble learning that can be used to combine the predictions from multiple
learners has been widely applied in pattern recognition, and has been reported
to be more robust and accurate than the individual learners. This ensemble
logic has recently also been more applied in feature selection. There are
basically two strategies for ensemble feature selection, namely data
perturbation and function perturbation. Data perturbation performs feature
selection on data subsets sampled from the original dataset and then selects
the features consistently ranked highly across those data subsets. This has
been found to improve both the stability of the selector and the prediction
accuracy for a classifier. Function perturbation frees the user from having to
decide on the most appropriate selector for any given situation and works by
aggregating multiple selectors. This has been found to maintain or improve
classification performance. Here we propose a framework, EFSIS, combining these
two strategies. Empirical results indicate that EFSIS gives both high
prediction accuracy and stability.Comment: 20 pages, 3 figure
BcCluster: a bladder cancer database at the molecular level
Background:
Bladder Cancer (BC) has two clearly distinct phenotypes. Non-muscle invasive BC has good prognosis and is treated with tumor resection and intravesical therapy whereas muscle invasive BC has poor prognosis and requires usually systemic cisplatin based chemotherapy either prior to or after radical cystectomy. Neoadjuvant chemotherapy is not often used for patients undergoing cystectomy. High-throughput analytical omics techniques are now available that allow the identification of individual molecular signatures to characterize the invasive phenotype. However, a large amount of data produced by omics experiments is not easily accessible since it is often scattered over many publications or stored in supplementary files.
Objective:
To develop a novel open-source database, BcCluster (http://www.bccluster.org/), dedicated to the comprehensive molecular characterization of muscle invasive bladder carcinoma.
Materials:
A database was created containing all reported molecular features significant in invasive BC. The query interface was developed in Ruby programming language (version 1.9.3) using the web-framework Rails (version 4.1.5) (http://rubyonrails.org/).
Results:
BcCluster contains the data from 112 published references, providing 1,559 statistically significant features relative to BC invasion. The database also holds 435 protein-protein interaction data and 92 molecular pathways significant in BC invasion. The database can be used to retrieve binding partners and pathways for any protein of interest. We illustrate this possibility using survivin, a known BC biomarker.
Conclusions:
BcCluster is an online database for retrieving molecular signatures relative to BC invasion. This application offers a comprehensive view of BC invasiveness at the molecular level and allows formulation of research hypotheses relevant to this phenotype
Quantification and expert evaluation of evidence for chemopredictive biomarkers to personalize cancer treatment.
Predictive biomarkers have the potential to facilitate cancer precision medicine by guiding the optimal choice of therapies for patients. However, clinicians are faced with an enormous volume of often-contradictory evidence regarding the therapeutic context of chemopredictive biomarkers.We extensively surveyed public literature to systematically review the predictive effect of 7 biomarkers claimed to predict response to various chemotherapy drugs: ERCC1-platinums, RRM1-gemcitabine, TYMS-5-fluorouracil/Capecitabine, TUBB3-taxanes, MGMT-temozolomide, TOP1-irinotecan/topotecan, and TOP2A-anthracyclines. We focused on studies that investigated changes in gene or protein expression as predictors of drug sensitivity or resistance. We considered an evidence framework that ranked studies from high level I evidence for randomized controlled trials to low level IV evidence for pre-clinical studies and patient case studies.We found that further in-depth analysis will be required to explore methodological issues, inconsistencies between studies, and tumor specific effects present even within high evidence level studies. Some of these nuances will lend themselves to automation, others will require manual curation. However, the comprehensive cataloging and analysis of dispersed public data utilizing an evidence framework provides a high level perspective on clinical actionability of these protein biomarkers. This framework and perspective will ultimately facilitate clinical trial design as well as therapeutic decision-making for individual patients
PeptiCKDdb-peptide- and protein-centric database for the investigation of genesis and progression of chronic kidney disease
The peptiCKDdb is a publicly available database platform dedicated to support research in the field of chronic kidney disease (CKD) through identification of novel biomarkers and molecular features of this complex pathology. PeptiCKDdb collects peptidomics and proteomics datasets manually extracted from published studies related to CKD. Datasets from peptidomics or proteomics, human case/control studies on CKD and kidney or urine profiling were included. Data from 114 publications (studies of body fluids and kidney tissue: 26 peptidomics and 76 proteomics manuscripts on human CKD, and 12 focusing on healthy proteome profiling) are currently deposited and the content is quarterly updated. Extracted datasets include information about the experimental setup, clinical study design, discovery-validation sample sizes and list of differentially expressed proteins (P-value < 0.05). A dedicated interactive web interface, equipped with multiparametric search engine, data export and visualization tools, enables easy browsing of the data and comprehensive analysis. In conclusion, this repository might serve as a source of data for integrative analysis or a knowledgebase for scientists seeking confirmation of their findings and as such, is expected to facilitate the modeling of molecular mechanisms underlying CKD and identification of biologically relevant biomarkers.Database URL: www.peptickddb.com
Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations
The development of molecular signatures for the prediction of time-to-event
outcomes is a methodologically challenging task in bioinformatics and
biostatistics. Although there are numerous approaches for the derivation of
marker combinations and their evaluation, the underlying methodology often
suffers from the problem that different optimization criteria are mixed during
the feature selection, estimation and evaluation steps. This might result in
marker combinations that are only suboptimal regarding the evaluation criterion
of interest. To address this issue, we propose a unified framework to derive
and evaluate biomarker combinations. Our approach is based on the concordance
index for time-to-event data, which is a non-parametric measure to quantify the
discrimatory power of a prediction rule. Specifically, we propose a
component-wise boosting algorithm that results in linear biomarker combinations
that are optimal with respect to a smoothed version of the concordance index.
We investigate the performance of our algorithm in a large-scale simulation
study and in two molecular data sets for the prediction of survival in breast
cancer patients. Our numerical results show that the new approach is not only
methodologically sound but can also lead to a higher discriminatory power than
traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result
- …