753 research outputs found

    The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

    Get PDF
    Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

    Classification of breast cancer grades using physical parameters and K-nearest neighbor method

    Get PDF
    Breast cancer is a health problem in the world. To overcome this problem requires early detection of breast cancer. The purpose of this study is to classify early breast cancer grades. Combination of physical parameters with k-nearest neighbor Method is proposed to detect early breast cancer grades. The experiments were performed on 87 mammograms consisting of 12 mammograms of grade 1,41 mammograms of grade 2 and 34 mammogram of grade 3. The proposed method was effective to classify the grades of breast cancer by an accuracy of 64.36%, 50% sensitivity and 73,5% specitifity. Physical parameters can be used to classify grades of breast cancer. The results of this study can be used to complement the diagnosis of breast mammography examination

    Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data

    Get PDF
    Background: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. Results: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. Conclusions: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality.Peer ReviewedPostprint (published version

    Predicting breast cancer risk, recurrence and survivability

    Full text link
    This thesis focuses on predicting breast cancer at early stages by using machine learning algorithms based on biological datasets. The accuracy of those algorithms has been improved to enable the physicians to enhance the success of treatment, thus saving lives and avoiding several further medical tests

    A Machine Learning Framework for Identifying Molecular Biomarkers from Transcriptomic Cancer Data

    Get PDF
    Cancer is a complex molecular process due to abnormal changes in the genome, such as mutation and copy number variation, and epigenetic aberrations such as dysregulations of long non-coding RNA (lncRNA). These abnormal changes are reflected in transcriptome by turning oncogenes on and tumor suppressor genes off, which are considered cancer biomarkers. However, transcriptomic data is high dimensional, and finding the best subset of genes (features) related to causing cancer is computationally challenging and expensive. Thus, developing a feature selection framework to discover molecular biomarkers for cancer is critical. Traditional approaches for biomarker discovery calculate the fold change for each gene, comparing expression profiles between tumor and healthy samples, thus failing to capture the combined effect of the whole gene set. Also, these approaches do not always investigate cancer-type prediction capabilities using discovered biomarkers. In this work, we proposed a machine learning-based framework to address all of the above challenges in discovering lncRNA biomarkers. First, we developed a machine learning pipeline that takes lncRNA expression profiles of cancer samples as input and outputs a small set of key lncRNAs that can accurately predict multiple cancer types. A significant innovation of our work is its ability to identify biomarkers without using healthy samples. However, this initial framework cannot identify cancer-specific lncRNAs. Second, we extended our framework to identify cancer type and subtype-specific lncRNAs. Third, we proposed to use a state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. Thus, we proposed a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. Our deep learning-based pipeline significantly extended the previous state-of-the-art feature selection techniques. Finally, we showed that discovered biomarkers are biologically relevant using literature review and prognostically significant using survival analyses. The discovered novel biomarkers could be used as a screening tool for different cancer diagnoses and as therapeutic targets

    Statistical Analysis and Deep Learning Associated Modeling for Early stage Detection of Carinoma

    Get PDF
    The high death rate and overall complexity of the cancer epidemic is a global health crisis. Progress in cancer prediction based on gene expression has increased in light of the speedy advancement using modern high-throughput sequencing methods and a wide range of machine learning techniques, bringing insights into efficient and precise treatment decision-making. Therefore, it is of significant interest to create machine learning systems that accurately identify cancer patients and healthy people. Although several classification systems have been applied to cancer prediction, no single strategy has proven superior. This research shows how to apply deep learning to an optimization method that uses numerous machine learning models. Statistical analysis has helped us choose informative genes, and we've been feeding those to five different categorization models. The results from the five different classifiers are ensembled in the next step using a deep learning technique. The three most common types of adenocarcinoma are those of the lungs, stomach, and breasts. The suggested deep learning-based inter-ensembles model was tested with deep learning-based algorithms on Carcinoma data. The results of the tests show that relative to using only one set of classifiers or the simple consensus algorithm, it improves the precision of cancer prognosis in every analyzed carcinoma dataset. The suggested deep learning-based inter-ensemble approach is demonstrated to be reliable and efficient for cancer diagnosis by entirely using diverse classifiers
    corecore