Search CORE

eScholarship - University of California

Integrated smoothed location model and data reduction approaches for multi variables classification

Author: Hashibah Hamid
Publication venue
Publication date: 01/01/2014
Field of study

Smoothed Location Model is a classification rule that deals with mixture of continuous variables and binary variables simultaneously. This rule discriminates groups in a parametric form using conditional distribution of the continuous variables given each pattern of the binary variables. To conduct a practical classification analysis, the objects must first be sorted into the cells of a multinomial table generated from the binary variables. Then, the parameters in each cell will be estimated using the sorted objects. However, in many situations, the estimated parameters are poor if the number of binary is large relative to the size of sample. Large binary variables will create too many multinomial cells which are empty, leading to high sparsity problem and finally give exceedingly poor performance for the constructed rule. In the worst case scenario, the rule cannot be constructed. To overcome such shortcomings, this study proposes new strategies to extract adequate variables that contribute to optimum performance of the rule. Combinations of two extraction techniques are introduced, namely 2PCA and PCA+MCA with new cutpoints of eigenvalue and total variance explained, to determine adequate extracted variables which lead to minimum misclassification rate. The outcomes from these extraction techniques are used to construct the smoothed location models, which then produce two new approaches of classification called 2PCALM and 2DLM. Numerical evidence from simulation studies demonstrates that the computed misclassification rate indicates no significant difference between the extraction techniques in normal and non-normal data. Nevertheless, both proposed approaches are slightly affected for non-normal data and severely affected for highly overlapping groups. Investigations on some real data sets show that the two approaches are competitive with, and better than other existing classification methods. The overall findings reveal that both proposed approaches can be considered as improvement to the location model, and alternatives to other classification methods particularly in handling mixed variables with large binary size

Universiti Utara Malaysia: UUM eTheses

Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity

Author: Byung-Jun Yoon
Edward R. Dougherty
Gustavo Stolovitzky
Junjie Su
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods

CiteSeerX

Public Library of Science (PLOS)

Springer - Publisher Connector

Texas A&M Repository

Identification of Biomarkers That Distinguish Chemical Contaminants Based on Gene Expression Profiles

Author: Ai Junmei
Ang Choo Y
Deng Youping
Guan Xin
Johnson David R.
Perkins Edward J.
Wei Xiaomou
Zhang Chaoyang
Publication venue: The Aquila Digital Community
Publication date: 01/01/2014
Field of study

Background: High throughput transcriptomics profiles such as those generated using microarrays have been useful in identifying biomarkers for different classification and toxicity prediction purposes. Here, we investigated the use of microarrays to predict chemical toxicants and their possible mechanisms of action. Results: In this study, in vitro cultures of primary rat hepatocytes were exposed to 105 chemicals and vehicle controls, representing 14 compound classes. We comprehensively compared various normalization of gene expression profiles, feature selection and classification algorithms for the classification of these 105 chemicals into14 compound classes. We found that normalization had little effect on the averaged classification accuracy. Two support vector machine (SVM) methods, LibSVM and sequential minimal optimization, had better classification performance than other methods. SVM recursive feature selection (SVM-RFE) had the highest overfitting rate when an independent dataset was used for a prediction. Therefore, we developed a new feature selection algorithm called gradient method that had a relatively high training classification as well as prediction accuracy with the lowest overfitting rate of the methods tested. Analysis of biomarkers that distinguished the 14 classes of compounds identified a group of genes principally involved in cell cycle function that were significantly downregulated by metal and inflammatory compounds, but were induced by anti-microbial, cancer related drugs, pesticides, and PXR mediators. Conclusions: Our results indicate that using microarrays and a supervised machine learning approach to predict chemical toxicants, their potential toxicity and mechanisms of action is practical and efficient. Choosing the right feature and classification algorithms for this multiple category classification and prediction is critical

Aquila Digital Community

Louisiana State University

Gene set based ensemble methods for cancer classification

Author: Duncan William Evans
Publication venue: LSU Digital Commons
Publication date: 01/01/2013
Field of study

Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are ℓ1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored

A voting approach to identify a small number of highly predictive genes using multiple classifiers

Author: A Subramanian
B Vural
BH Park
C Ambroise
C Ritz
C Sotiriou
C Trastour
D Bamber
E Huang
E Ogier-Denis
G Alexe
Geoff Macintyre
H Mamitsuka
IB Jeffery
IJ Rantala
JA Hanley
James Bailey
Joshua WK Ho
JP Egan
Kotagiri Ramamohanarao
L Breiman
L Simi
LJ van't Veer
M Dettling
M Maruf Hossain
M Pilkinton
M Schena
M Uemura
Md Rafiul Hassan
MJ Heller
MJvan de Vijver
N Landwehr
O Gevaert
R Kohavi
RA Gupta
RD Loberg
S Carreira
S Gagrica
S Henikoff
S Loi
T Fawcett
T Sjoblom
T Sørlie
T Sørlie
XJ Ma
Y Freund
Y Liu
Y Sun
Y Wu
Z Liu
Z Madjd
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Microarray gene expression profiling has provided extensive datasets that can describe characteristics of cancer patients. An important challenge for this type of data is the discovery of gene sets which can be used as the basis of developing a clinical predictor for cancer. It is desirable that such gene sets be compact, give accurate predictions across many classifiers, be biologically relevant and have good biological process coverage. Results By using a new type of multiple classifier voting approach, we have identified gene sets that can predict breast cancer prognosis accurately, for a range of classification algorithms. Unlike a wrapper approach, our method is not specialised towards a single classification technique. Experimental analysis demonstrates higher prediction accuracies for our sets of genes compared to previous work in the area. Moreover, our sets of genes are generally more compact than those previously proposed. Taking a biological viewpoint, from the literature, most of the genes in our sets are known to be strongly related to cancer. Conclusion We show that it is possible to obtain superior classification accuracy with our approach and obtain a compact gene set that is also biologically relevant and has good coverage of different biological processes.</p

Springer - Publisher Connector

University of Melbourne Institutional Repository

Development of Artificial Intelligence systems as a prediction tool in ovarian cancer

Author: Enshaei Amir
Publication venue: Newcastle University
Publication date: 01/01/2012
Field of study

PhD ThesisOvarian cancer is the 5th most common cancer in females and the UK has one of the highest incident rates in Europe. In the UK only 36% of patients will live for at least 5 years after diagnosis. The number of prognostic markers, treatments and the sequences of treatments in ovarian cancer are rising. Therefore, it is getting more difficult for the human brain to perform clinical decision making. There is a need for an expert computer system (e.g. Artificial Intelligence (AI)), which is capable of investigating the possible outcomes for each marker, treatment and sequence of treatment. Such expert systems may provide a tool which could help clinicians to analyse and predict outcome using different treatment pathways. Whilst prediction of overall survival of a patient is difficult there may be some benefits, as this not only is useful information for the patient but may also determine treatment modality. In this project a dataset was constructed of 352 patients who had been treated at a single centre. Clinical data were extracted from the health records. Expert systems were then investigated to determine the optimum model to predict overall survival of a patient. The five year survival period (a standard survival outcome measure in cancer research) was investigated; in addition, the system was developed with the flexibility to predict patient survival rates for many other categories. Comparisons with currently used prognostic models in ovarian cancer demonstrated a significant improvement in performance for the AI model (Area under the Curve (AUC) of 0.72 for AI and AUC of 0.62 for the statistical model). Using various methods, the most important variables in this prediction were identified as: FIGO stage, outcome of the surgery and CA125. This research investigated the effects of increasing the number of cases in prediction models. Results indicated that by increasing the number of cases, the prediction performance improved. Categorization of continuous data did not improve the prediction performance. The project next investigated the possibility of predicting surgical outcomes in ovarian cancer using AI, based on the variables that are available for clinicians prior to the surgery. Such a tool could have direct clinical relevance. Diverse models that can predict the outcome of the surgery were investigated and developed. The developed AI models were also compared against the standard statistical prediction model, which demonstrated that the AI model outperformed the statistical prediction model: the prediction of all outcomes (complete or optimal or suboptimal) (AUC of AI: 0.71 and AUC of statistical model: 0.51), the prediction of complete or optimal cytoreduction versus suboptimal cytoreduction (AUC of AI: 0.73 and AUC of statistical model: 0.50) and finally the prediction of complete cytoreduction versus optimal or suboptimal cytoreduction (AUC of AI: 0.79 and AUC of statistical model: 0.47). The most important variables for this prediction were identified as: FIGO stage, tumour grade and histology. The application of transcriptomic analysis to cancer research raises the question of which genes are significantly involved in a particular cancer and which genes can accurately predict survival outcomes in a given cancer. Therefore, AI techniques were employed to identify the most important genes for the prediction of Homologous Recombination (HR), an important DNA repair pathway in ovarian cancer, identifying LIG1 and POLD3 as novel prognostic biomarkers. Finally, AI models were used to predict the HR status for any given patient (AUC: 0.87). This project has demonstrated that AI may have an important role in ovarian cancer. AI systems may provide tools to help clinicians and research in ovarian cancer and may allow more informed decisions resulting in better management of this cancer

Newcastle University eTheses

Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

Author: A Antoniadis
A Boulesteix
A Boulesteix
A Hoerl
A Höskuldsson
B Ding
B Marx
C Chih-Yu Wang
D Chung
D Nguyen
D Nguyen
D Witten
E Bair
E Parkhomenko
F Bach
G Fort
H Chun
H Shen
H Wold
H Zou
I Guyon
I Guyon
I Jolliffe
J Dai
J Khan
K Liu
KA Lê Cao
KA Lê Cao
KA Lê Cao
KA Lê Cao
KA Lê Cao
KA Lê Cao
Kim-Anh Lê Cao
L Breiman
L Breiman
M Ahdesmäki
M Ashburner
M Barker
M Jakobsson
N Meinshausen
Philippe Besse
R Tibshirani
R Tibshirani
S Dudoit
S Pomeroy
S Ramaswamy
S Wold
Simon Boitard
T Golub
T Yang
VN Vapnik
X Huang
X Huang
X Qiao
X Zhou
Y Tan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.Results: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.Conclusions: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets

Scientific Publications of the University of Toulouse II Le Mirail

Springer - Publisher Connector

University of Melbourne Institutional Repository

HAL-INSA Toulouse

ProdInra

University of Queensland eSpace

Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition

Author: Chen Yun-Mei
Tian Li
Wu Wen-Ming
Xu Shuang
Yang Li-Jun
Yang Xiao-Hui
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.Comment: 14 pages, 19 figures, 10 table

arXiv.org e-Print Archive