41 research outputs found

    Study on Ensemble Algorithm for Multi-class Gene Microarray Datasets

    Get PDF
    集成学习是当前机器学习领域的一个研究热点,具体到多分类问题,旨在通过一组差异的分类器共同解决起初的多分类问题,然后经过大多数投票等策略将各个分类器的输出结果进行融合。集成多分类算法相比于单个的优秀分类器往往性能上更准确、更稳定,同时还具有更强的泛化能力。在解决多分类问题时,基于纠错输出编码算法(ECOC)。这是解决多分类问题的一种灵活、高效的算法框架,关键要点是将多分类转变为多个二分类问题。此外,遗传规划算法可用于解决二分类问题,通过进化计算得到准确的分类规则。本文在已有的研究基础上,对基因微阵列数据的集成多分类学习进行了理论探索和实践。 本文主要围绕着集成多分类算法,应用于基因微阵列数据的...Ensemble learning is a current research focus in the field of machine learning. It applies a set of diverse classifiers together in order to solve the original task as specific to multi-class classification problems and fuses the output of each classifier through majority voting. Multi-class classification ensemble algorithm is more accurate and more stable than an excellent classifier, and has gr...学位:工学硕士院系专业:软件学院_软件工程学号:2432012115229

    Multiclass classification of microarray data samples with a reduced number of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.</p> <p>Results</p> <p>A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.</p> <p>Conclusions</p> <p>A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.</p

    Machine learning applications for the topology prediction of transmembrane beta-barrel proteins

    Get PDF
    The research topic for this PhD thesis focuses on the topology prediction of beta-barrel transmembrane proteins. Transmembrane proteins adopt various conformations that are about the functions that they provide. The two most predominant classes are alpha-helix bundles and beta-barrel transmembrane proteins. Alpha-helix proteins are present in larger numbers than beta-barrel transmembrane proteins in structure databases. Therefore, there is a need to find computational tools that can predict and detect the structure of beta-barrel transmembrane proteins. Transmembrane proteins are used for active transport across the membrane or signal transduction. Knowing the importance of their roles, it becomes essential to understand the structures of the proteins. Transmembrane proteins are also a significant focus for new drug discovery. Transmembrane beta-barrel proteins play critical roles in the translocation machinery, pore formation, membrane anchoring, and ion exchange. In bioinformatics, many years of research have been spent on the topology prediction of transmembrane alpha-helices. The efforts to TMB (transmembrane beta-barrel) proteins topology prediction have been overshadowed, and the prediction accuracy could be improved with further research. Various methodologies have been developed in the past to predict TMB proteins topology. Methods developed in the literature that are available include turn identification, hydrophobicity profiles, rule-based prediction, HMM (Hidden Markov model), ANN (Artificial Neural Networks), radial basis function networks, or combinations of methods. The use of cascading classifier has never been fully explored. This research presents and evaluates approaches such as ANN (Artificial Neural Networks), KNN (K-Nearest Neighbors, SVM (Support Vector Machines), and a novel approach to TMB topology prediction with the use of a cascading classifier. Computer simulations have been implemented in MATLAB, and the results have been evaluated. Data were collected from various datasets and pre-processed for each machine learning technique. A deep neural network was built with an input layer, hidden layers, and an output. Optimisation of the cascading classifier was mainly obtained by optimising each machine learning algorithm used and by starting using the parameters that gave the best results for each machine learning algorithm. The cascading classifier results show that the proposed methodology predicts transmembrane beta-barrel proteins topologies with high accuracy for randomly selected proteins. Using the cascading classifier approach, the best overall accuracy is 76.3%, with a precision of 0.831 and recall or probability of detection of 0.799 for TMB topology prediction. The accuracy of 76.3% is achieved using a two-layers cascading classifier. By constructing and using various machine-learning frameworks, systems were developed to analyse the TMB topologies with significant robustness. We have presented several experimental findings that may be useful for future research. Using the cascading classifier, we used a novel approach for the topology prediction of TMB proteins

    Comparison of Two Output-Coding Strategies for Multi-Class Tumor Classification Using Gene Expression Data and Latent Variable Model as Binary Classifier

    Get PDF
    Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data

    A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application.</p> <p>Results</p> <p>A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type.</p> <p>Conclusion</p> <p>We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.</p

    Hierarchical information representation and efficient classification of gene expression microarray data

    Get PDF
    In the field of computational biology, microarryas are used to measure the activity of thousands of genes at once and create a global picture of cellular function. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and eficiently. Even if microarrays are a consolidated research technology nowadays and the trends in high-throughput data analysis are shifting towards new technologies like Next Generation Sequencing (NGS), an optimum method for sample classification has not been found yet. Microarray classification is a complicated task, not only due to the high dimensionality of the feature set, but also to an apparent lack of data structure. This characteristic limits the applicability of processing techniques, such as wavelet filtering or other filtering techniques that take advantage of known structural relation. On the other hand, it is well known that genes are not expressed independently from other each other: genes have a high interdependence related to the involved regulating biological process. This thesis aims to improve the current state of the art in microarray classification and to contribute to understand how signal processing techniques can be developed and applied to analyze microarray data. The goal of building a classification framework needs an exploratory work in which algorithms are constantly tried and adapted to the analyzed data. The developed algorithms and classification frameworks in this thesis tackle the problem with two essential building blocks. The first one deals with the lack of a priori structure by inferring a data-driven structure with unsupervised hierarchical clustering tools. The second key element is a proper feature selection tool to produce a precise classifier as an output and to reduce the overfitting risk. The main focus in this thesis is the binary data classification, field in which we obtained relevant improvements to the state of the art. The first key element is the data-driven structure, obtained by modifying hierarchical clustering algorithms derived from the Treelets algorithm from the literature. Several alternatives to the original reference algorithm have been tested, changing either the similarity metric to merge the feature or the way two feature are merged. Moreover, the possibility to include external sources of information from publicly available biological knowledge and ontologies to improve the structure generation has been studied too. About the feature selection, two alternative approaches have been studied: the first one is a modification of the IFFS algorithm as a wrapper feature selection, while the second approach involved an ensemble learning focus. To obtain good results, the IFFS algorithm has been adapted to the data characteristics by introducing new elements to the selection process like a reliability measure and a scoring system to better select the best feature at each iteration. The second feature selection approach is based on Ensemble learning, taking advantage of the microarryas feature abundance to implement a different selection scheme. New algorithms have been studied in this field, improving state of the art algorithms to the microarray data characteristic of small sample and high feature numbers. In addition to the binary classification problem, the multiclass case has been addressed too. A new algorithm combining multiple binary classifiers has been evaluated, exploiting the redundancy offered by multiple classifiers to obtain better predictions. All the studied algorithm throughout this thesis have been evaluated using high quality publicly available data, following established testing protocols from the literature to offer a proper benchmarking with the state of the art. Whenever possible, multiple Monte Carlo simulations have been performed to increase the robustness of the obtained results.En el campo de la biología computacional, los microarrays son utilizados para medir la actividad de miles de genes a la vez y producir una representación global de la función celular. Los microarrays permiten analizar la expresión de muchos genes en un solo experimento, rápidamente y eficazmente. Aunque los microarrays sean una tecnología de investigación consolidada hoy en día y la tendencia es en utilizar nuevas tecnologías como Next Generation Sequencing (NGS), aun no se ha encontrado un método óptimo para la clasificación de muestras. La clasificación de muestras de microarray es una tarea complicada, debido al alto número de variables y a la falta de estructura entre los datos. Esta característica impide la aplicación de técnicas de procesado que se basan en relaciones estructurales, como el filtrado con wavelet u otras técnicas de filltrado. Por otro lado, los genes no se expresen independientemente unos de otros: los genes están inter-relacionados según el proceso biológico que les regula. El objetivo de esta tesis es mejorar el estado del arte en la clasi cación de microarrays y contribuir a entender cómo se pueden diseñar y aplicar técnicas de procesado de señal para analizar microarrays. El objetivo de construir un algoritmo de clasi cación, necesita un estudio de comprobaciones y adaptaciones de algoritmos existentes a los datos analizados. Los algoritmo desarrollados en esta tesis encaran el problema con dos bloques esenciales. El primero ataca la falta de estructura, derivando un árbol binario usando herramientas de clustering no supervisado. El segundo elemento fundamental para obtener clasificadores precisos reduciendo el riesgo de overfitting es un elemento de selección de variables. La principal tarea en esta tesis es la clasificación de datos binarios en la cual hemos obtenido mejoras relevantes al estado del arte. El primer paso es la generación de una estructura, para eso se ha utilizado el algoritmo Treelets disponible en la literatura. Múltiples alternativas a este algoritmo original han sido propuestas y evaluadas, cambiando las métricas de similitud o las reglas de fusión durante el proceso. Además, se ha estudiado la posibilidad de usar fuentes de información externas, como ontologías de información biológica, para mejorar la inferencia de la estructura. Se han estudiado dos enfoques diferentes para la selección de variables: el primero es una modificación del algoritmo IFFS y el segundo utiliza un esquema de aprendizaje con “ensembles”. El algoritmo IFFS ha sido adaptado a las características de microarrays para obtener mejores resultados, añadiendo elementos como la medida de fiabilidad y un sistema de evaluación para seleccionar la mejor variable en cada iteración. El método que utiliza “ensembles” aprovecha la abundancia de features de los microarrays para implementar una selección diferente. En este campo se han estudiado diferentes algoritmos, mejorando alternativas ya existentes al escaso número de muestras y al alto número de variables, típicos de los microarrays. El problema de clasificación con más de dos clases ha sido también tratado al estudiar un nuevo algoritmo que combina múltiples clasificadores binarios. El algoritmo propuesto aprovecha la redundancia ofrecida por múltiples clasificadores para obtener predicciones más fiables. Todos los algoritmos propuestos en esta tesis han sido evaluados con datos públicos y de alta calidad, siguiendo protocolos establecidos en la literatura para poder ofrecer una comparación fiable con el estado del arte. Cuando ha sido posible, se han aplicado simulaciones Monte Carlo para mejorar la robustez de los resultados

    An improved multiple classifier combination scheme for pattern classification

    Get PDF
    Combining multiple classifiers are considered as a new direction in the pattern recognition to improve classification performance. The main problem of multiple classifier combination is that there is no standard guideline for constructing an accurate and diverse classifier ensemble. This is due to the difficulty in identifying the number of homogeneous classifiers and how to combine the classifier outputs. The most commonly used ensemble method is the random strategy while the majority voting technique is used as the combiner. However, the random strategy cannot determine the number of classifiers and the majority voting technique does not consider the strength of each classifier, thus resulting in low classification accuracy. In this study, an improved multiple classifier combination scheme is proposed. The ant system (AS) algorithm is used to partition feature set in developing feature subsets which represent the number of classifiers. A compactness measure is introduced as a parameter in constructing an accurate and diverse classifier ensemble. A weighted voting technique is used to combine the classifier outputs by considering the strength of the classifiers prior to voting. Experiments were performed using four base classifiers, which are Nearest Mean Classifier (NMC), Naive Bayes Classifier (NBC), k-Nearest Neighbour (k-NN) and Linear Discriminant Analysis (LDA) on benchmark datasets, to test the credibility of the proposed multiple classifier combination scheme. The average classification accuracy of the homogeneous NMC, NBC, k-NN and LDA ensembles are 97.91%, 98.06%, 98.09% and 98.12% respectively. The accuracies are higher than those obtained through the use of other approaches in developing multiple classifier combination. The proposed multiple classifier combination scheme will help to develop other multiple classifier combination for pattern recognition and classification

    Identification of pathway and gene markers using enhanced directed random walk for multiclass cancer expression data

    Get PDF
    Cancer markers play a significant role in the diagnosis of the origin of cancers and in the detection of cancers from initial treatments. This is a challenging task owing to the heterogeneity nature of cancers. Identification of these markers could help in improving the survival rate of cancer patients, in which dedicated treatment can be provided according to the diagnosis or even prevention. Previous investigations show that the use of pathway topology information could help in the detection of cancer markers from gene expression. Such analysis reduces its complexity from thousands of genes to a few hundreds of pathways. However, most of the existing methods group different cancer subtypes into just disease samples, and consider all pathways contribute equally in the analysis process. Meanwhile, the interaction between multiple genes and the genes with missing edges has been ignored in several other methods, and hence could lead to the poor performance of the identification of cancer markers from gene expression. Thus, this research proposes enhanced directed random walk to identify pathway and gene markers for multiclass cancer gene expression data. Firstly, an improved pathway selection with analysis of variances (ANOVA) that enables the consideration of multiple cancer subtypes is performed, and subsequently the integration of k-mean clustering and average silhouette method in the directed random walk that considers the interaction of multiple genes is also conducted. The proposed methods are tested on benchmark gene expression datasets (breast, lung, and skin cancers) and biological pathways. The performance of the proposed methods is then measured and compared in terms of classification accuracy and area under the receiver operating characteristics curve (AUC). The results indicate that the proposed methods are able to identify a list of pathway and gene markers from the datasets with better classification accuracy and AUC. The proposed methods have improved the classification performance in the range of between 1% and 35% compared with existing methods. Cell cycle and p53 signaling pathway were found significantly associated with breast, lung, and skin cancers, while the cell cycle was highly enriched with squamous cell carcinoma and adenocarcinoma
    corecore