585 research outputs found

    Simple and Effective Visual Models for Gene Expression Cancer Diagnostics

    Get PDF
    In the paper we show that diagnostic classes in cancer gene expression data sets, which most often include thousands of features (genes), may be effectively separated with simple two-dimensional plots such as scatterplot and radviz graph. The principal innovation proposed in the paper is a method called VizRank, which is able to score and identify the best among possibly millions of candidate projections for visualizations. Compared to recently much applied techniques in the field of cancer genomics that include neural networks, support vector machines and various ensemble-based approaches, VizRank is fast and finds visualization models that can be easily examined and interpreted by domain experts. Our experiments on a number of gene expression data sets show that VizRank was always able to find data visualizations with a small number of (two to seven) genes and excellent class separation. In addition to providing grounds for gene expression cancer diagnosis, VizRank and its visualizations also identify small sets of relevant genes, uncover interesting gene interactions and point to outliers and potential misclassifications in cancer data sets

    Intelligent techniques using molecular data analysis in leukaemia: an opportunity for personalized medicine support system

    Get PDF
    The use of intelligent techniques in medicine has brought a ray of hope in terms of treating leukaemia patients. Personalized treatment uses patient’s genetic profile to select a mode of treatment. This process makes use of molecular technology and machine learning, to determine the most suitable approach to treating a leukaemia patient. Until now, no reviews have been published from a computational perspective concerning the development of personalized medicine intelligent techniques for leukaemia patients using molecular data analysis. This review studies the published empirical research on personalized medicine in leukaemia and synthesizes findings across studies related to intelligence techniques in leukaemia, with specific attention to particular categories of these studies to help identify opportunities for further research into personalized medicine support systems in chronic myeloid leukaemia. A systematic search was carried out to identify studies using intelligence techniques in leukaemia and to categorize these studies based on leukaemia type and also the task, data source, and purpose of the studies. Most studies used molecular data analysis for personalized medicine, but future advancement for leukaemia patients requires molecular models that use advanced machine-learning methods to automate decision-making in treatment management to deliver supportive medical information to the patient in clinical practice.Haneen Banjar, David Adelson, Fred Brown, and Naeem Chaudhr

    Accurate molecular classification of cancer using simple rules

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.</p> <p>Methods</p> <p>We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.</p> <p>Results</p> <p>We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.</p> <p>Conclusion</p> <p>In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.</p

    Hematological image analysis for acute lymphoblastic leukemia detection and classification

    Get PDF
    Microscopic analysis of peripheral blood smear is a critical step in detection of leukemia.However, this type of light microscopic assessment is time consuming, inherently subjective, and is governed by hematopathologists clinical acumen and experience. To circumvent such problems, an efficient computer aided methodology for quantitative analysis of peripheral blood samples is required to be developed. In this thesis, efforts are therefore made to devise methodologies for automated detection and subclassification of Acute Lymphoblastic Leukemia (ALL) using image processing and machine learning methods.Choice of appropriate segmentation scheme plays a vital role in the automated disease recognition process. Accordingly to segment the normal mature lymphocyte and malignant lymphoblast images into constituent morphological regions novel schemes have been proposed. In order to make the proposed schemes viable from a practical and real–time stand point, the segmentation problem is addressed in both supervised and unsupervised framework. These proposed methods are based on neural network,feature space clustering, and Markov random field modeling, where the segmentation problem is formulated as pixel classification, pixel clustering, and pixel labeling problem respectively. A comprehensive validation analysis is presented to evaluate the performance of four proposed lymphocyte image segmentation schemes against manual segmentation results provided by a panel of hematopathologists. It is observed that morphological components of normal and malignant lymphocytes differ significantly. To automatically recognize lymphoblasts and detect ALL in peripheral blood samples, an efficient methodology is proposed.Morphological, textural and color features are extracted from the segmented nucleus and cytoplasm regions of the lymphocyte images. An ensemble of classifiers represented as EOC3 comprising of three classifiers shows highest classification accuracy of 94.73% in comparison to individual members. The subclassification of ALL based on French–American–British (FAB) and World Health Organization (WHO) criteria is essential for prognosis and treatment planning. Accordingly two independent methodologies are proposed for automated classification of malignant lymphocyte (lymphoblast) images based on morphology and phenotype. These methods include lymphoblast image segmentation, nucleus and cytoplasm feature extraction, and efficient classification

    Optimization Based Tumor Classification from Microarray Gene Expression Data

    Get PDF
    An important use of data obtained from microarray measurements is the classification of tumor types with respect to genes that are either up or down regulated in specific cancer types. A number of algorithms have been proposed to obtain such classifications. These algorithms usually require parameter optimization to obtain accurate results depending on the type of data. Additionally, it is highly critical to find an optimal set of markers among those up or down regulated genes that can be clinically utilized to build assays for the diagnosis or to follow progression of specific cancer types. In this paper, we employ a mixed integer programming based classification algorithm named hyper-box enclosure method (HBE) for the classification of some cancer types with a minimal set of predictor genes. This optimization based method which is a user friendly and efficient classifier may allow the clinicians to diagnose and follow progression of certain cancer types.We apply HBE algorithm to some well known data sets such as leukemia, prostate cancer, diffuse large B-cell lymphoma (DLBCL), small round blue cell tumors (SRBCT) to find some predictor genes that can be utilized for diagnosis and prognosis in a robust manner with a high accuracy. Our approach does not require any modification or parameter optimization for each data set. Additionally, information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection methods are employed for the gene selection. The results are compared with those from other studies and biological roles of selected genes in corresponding cancer type are described.The performance of our algorithm overall was better than the other algorithms reported in the literature and classifiers found in WEKA data-mining package. Since it does not require a parameter optimization and it performs consistently very high prediction rate on different type of data sets, HBE method is an effective and consistent tool for cancer type prediction with a small number of gene markers

    Cancer proteogenomics : connecting genotype to molecular phenotype

    Get PDF
    The central dogma of molecular biology describes the one-way road from DNA to RNA and finally to protein. Yet, how this flow of information encoded in DNA as genes (genotype) is regulated in order to produce the observable traits of an individual (phenotype) remains unanswered. Recent advances in high-throughput data, i.e., ‘omics’, have allowed the quantification of DNA, RNA and protein levels leading to integrative analyses that essentially probe the central dogma along all of its constituent molecules. Evidence from these analyses suggest that mRNA abundances are at best a moderate proxy for proteins which are the main functional units of cells and thus closer to the phenotype. Cancer proteogenomic studies consider the ensemble of proteins, the so-called proteome, as the readout of the functional molecular phenotype to investigate its influence by upstream events, for example DNA copy number alterations. In typical proteogenomic studies, however, the identified proteome is a simplification of its actual composition, as they methodologically disregard events such as splicing, proteolytic cleavage and post-translational modifications that generate unique protein species – proteoforms. The scope of this thesis is to study the proteome diversity in terms of: a) the complex genetic background of three tumor types, i.e. breast cancer, childhood acute lymphoblastic leukemia and lung cancer, and b) the proteoform composition, describing a computational method for detecting protein species based on their distinct quantitative profiles. In Paper I, we present a proteogenomic landscape of 45 breast cancer samples representative of the five PAM50 intrinsic subtypes. We studied the effect of copy number alterations (CNA) on mRNA and protein levels, overlaying a public dataset of drug- perturbed protein degradation. In Paper II, we describe a proteogenomic analysis of 27 B-cell precursor acute lymphoblastic leukemia clinical samples that compares high hyperdiploid versus ETV6/RUNX1-positive cases. We examined the impact of the amplified chromosomes on mRNA and protein abundance, specifically the linear trend between the amplification level and the dosage effect. Moreover, we investigated mRNA-protein quantitative discrepancies with regard to post-transcriptional and post-translational effects such as mRNA/protein stability and miRNA targeting. In Paper III, we describe a proteogenomic cohort of 141 non-small cell lung cancer clinical samples. We used clustering methods to identify six distinct proteome-based subtypes. We integrated the protein abundances in pathways using protein-protein correlation networks, bioinformatically deconvoluted the immune composition and characterized the neoantigen burden. In Paper IV, we developed a pipeline for proteoform detection from bottom-up mass- spectrometry-based proteomics. Using an in-depth proteomics dataset of 18 cancer cell lines, we identified proteoforms related to splice variant peptides supported by RNA-seq data. This thesis adds on the previous literature of proteogenomic studies by analyzing the tumor proteome and its regulation along the flow of the central dogma of molecular biology. It is anticipated that some of these findings would lead to novel insights about tumor biology and set the stage for clinical applications to improve the current cancer patient care

    Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Signal-to-Noise-Ratio (SNR) is often used for identification of biomarkers for two-class problems and no formal and useful generalization of SNR is available for multiclass problems. We propose innovative generalizations of SNR for multiclass cancer discrimination through introduction of two indices, Gene Dominant Index and Gene Dormant Index (GDIs). These two indices lead to the concepts of dominant and dormant genes with biological significance. We use these indices to develop methodologies for discovery of dominant and dormant biomarkers with interesting biological significance. The dominancy and dormancy of the identified biomarkers and their excellent discriminating power are also demonstrated pictorially using the scatterplot of individual gene and 2-D Sammon's projection of the selected set of genes. Using information from the literature we have shown that the GDI based method can identify dominant and dormant genes that play significant roles in cancer biology. These biomarkers are also used to design diagnostic prediction systems.</p> <p>Results and discussion</p> <p>To evaluate the effectiveness of the GDIs, we have used four multiclass cancer data sets (Small Round Blue Cell Tumors, Leukemia, Central Nervous System Tumors, and Lung Cancer). For each data set we demonstrate that the new indices can find biologically meaningful genes that can act as biomarkers. We then use six machine learning tools, Nearest Neighbor Classifier (NNC), Nearest Mean Classifier (NMC), Support Vector Machine (SVM) classifier with linear kernel, and SVM classifier with Gaussian kernel, where both SVMs are used in conjunction with one-vs-all (OVA) and one-vs-one (OVO) strategies. We found GDIs to be very effective in identifying biomarkers with strong class specific signatures. With all six tools and for all data sets we could achieve better or comparable prediction accuracies usually with fewer marker genes than results reported in the literature using the same computational protocols. The dominant genes are usually easy to find while good dormant genes may not always be available as dormant genes require stronger constraints to be satisfied; but when they are available, they can be used for authentication of diagnosis.</p> <p>Conclusion</p> <p>Since GDI based schemes can find a small set of dominant/dormant biomarkers that is adequate to design diagnostic prediction systems, it opens up the possibility of using real-time qPCR assays or antibody based methods such as ELISA for an easy and low cost diagnosis of diseases. The dominant and dormant genes found by GDIs can be used in different ways to design more reliable diagnostic prediction systems.</p

    CLINICAL AND BIOLOGICALLY-BASED APPROACHES FOR CLASSIFYING AND PREDICTING EARLY OUTCOMES OF CHRONIC CHILDHOOD ARTHRITIS

    Get PDF
    Background: Juvenile idiopathic arthritis (JIA) comprises a heterogeneous group of conditions that share chronic arthritis as a common characteristic. Current classification criteria for chronic childhood arthritis have limitations. Despite new treatment strategies and medications, some continue to have persistently active and disabling disease as adults. Few predictors of poor outcomes have been identified. Objectives: This thesis comprises two complementary studies. The objective of the first study was to identify discrete clusters comprising clinical features and inflammatory biomarkers in children with JIA and to compare them with the current JIA categories that have been proposed by the International League of Associations for Rheumatology. The second study aimed to identify predictors of short-term arthritis activity based on clinical and biomarker profiles in JIA patients. Methods: For both studies we utilized data that were collected in a Canadian nation-wide, prospective, longitudinal cohort study titled Biologically-Based Outcome Predictors in JIA. Clustering and classification algorithms were applied to the data to accomplish both study objectives. Results: This research identified three clusters of patients in visit 1 (enrolment) and five clusters in visit 2 (6-month). Clusters revealed in this analysis exposed different and more homogenous subgroups compared to the seven conventional JIA categories. In the second study, the presence or absence of active joints, physician global assessments, and Wallace criteria were chosen as outcome variables 18 months post-enrolment. Among 112 variables, 17 were selected as the best predictors of 18-month outcomes. The panel predicted presence or absence of active arthritis, physician global assessment, and Wallace criteria of inactive disease 18 months after diagnosis with 79%, 82%, and 71% accuracy and 0.83, 0.86, 0.82 area under the curve (AUC), respectively. The accuracy and AUC values were higher compared to when only clinical features were used for prediction. Conclusion: Results of this study suggest that certain groups of patients within different JIA categories are more aligned pathobiologically than their separate clinical categorizations suggest. Further, the research found a small number of clinical and inflammatory variables at diagnosis can more accurately predict short-term arthritis activity in JIA than clinical characteristics only

    Features of the intratumoral T cell receptor repertoire associated with antigen exposure in cancer patients

    Get PDF
    The clinical success of immunotherapies demonstrates the importance of the immune system in tumour control, but the response rates remain low and many biological mechanisms underlying how these therapies work are still uncharacterised. In particular, the specificity of the anti-tumour immune response pre-existing in treatment-naive patients or induced by treatment remains poorly described. In this thesis, I explore how T cell receptor (TCR) sequencing data in multi-omics contexts can be utilised to identify features associated with antigen exposure in cancer patients. In treatment-naive non-small cell lung cancer (NSCLC) patients, multi-region TCR sequencing revealed a pattern of heterogeneity in the TCR repertoire resembling the heterogeneity observed in the mutational profile of these tumours and a range of clonotype frequency values associated with tumour specificity. A novel method was built in order to identify distinct TCR populations that spatially follow the pattern of the well-established clonal/subclonal mutational dichotomy. The impact of immune checkpoint blockade therapy on the TCR repertoire distribution was assessed in advanced renal cell carcinoma in the context of anti- PD1 treatment. TCRs with frequency distribution characteristics similar to what was observed in NSCLC were maintained upon treatment and associated with clinical response. In addition, RNA-sequencing analysis identified a gene expression profile consistent with specific activation of T cells through TCR signalling. Finally, the same methodology was applied to bone marrow samples harvested from B cell acute lymphoblastic leukaemia (B-ALL) patients. A statistical framework was developed in order to efficiently distinguish leukaemic re-arrangements from the non- leukaemic TCR repertoire of B-ALL patients. Subsequently, longitudinal analysis revealed TCR distributions that suggested the presence of cytotoxic T cells which was further characterised in matched single-cell RNA sequencing data
    corecore