27 research outputs found

    Gene expression-based prediction of malignancies

    Get PDF
    Molecular classification of malignancies can potentially stratify patients into distinct subclasses not detectable using traditional classification of tumors, opening new perspectives on the diagnosis and personalized therapy of polygenic diseases. In this paper we present a brief overview of our work on gene expression based prediction of malignancies, starting from the dichotomic classification problem of normal versus tumoural tissues, to multiclasss cancer diagnosis and to functional class discovery and gene selection problems. The last part of this work present preliminary results about the applicatin of ensembles of SVMs based on bias-variance decomposition of the error to the analysis of gene expression data of malignant tissues

    Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers

    Get PDF

    Hierarchical information representation and efficient classification of gene expression microarray data

    Get PDF
    In the field of computational biology, microarryas are used to measure the activity of thousands of genes at once and create a global picture of cellular function. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and eficiently. Even if microarrays are a consolidated research technology nowadays and the trends in high-throughput data analysis are shifting towards new technologies like Next Generation Sequencing (NGS), an optimum method for sample classification has not been found yet. Microarray classification is a complicated task, not only due to the high dimensionality of the feature set, but also to an apparent lack of data structure. This characteristic limits the applicability of processing techniques, such as wavelet filtering or other filtering techniques that take advantage of known structural relation. On the other hand, it is well known that genes are not expressed independently from other each other: genes have a high interdependence related to the involved regulating biological process. This thesis aims to improve the current state of the art in microarray classification and to contribute to understand how signal processing techniques can be developed and applied to analyze microarray data. The goal of building a classification framework needs an exploratory work in which algorithms are constantly tried and adapted to the analyzed data. The developed algorithms and classification frameworks in this thesis tackle the problem with two essential building blocks. The first one deals with the lack of a priori structure by inferring a data-driven structure with unsupervised hierarchical clustering tools. The second key element is a proper feature selection tool to produce a precise classifier as an output and to reduce the overfitting risk. The main focus in this thesis is the binary data classification, field in which we obtained relevant improvements to the state of the art. The first key element is the data-driven structure, obtained by modifying hierarchical clustering algorithms derived from the Treelets algorithm from the literature. Several alternatives to the original reference algorithm have been tested, changing either the similarity metric to merge the feature or the way two feature are merged. Moreover, the possibility to include external sources of information from publicly available biological knowledge and ontologies to improve the structure generation has been studied too. About the feature selection, two alternative approaches have been studied: the first one is a modification of the IFFS algorithm as a wrapper feature selection, while the second approach involved an ensemble learning focus. To obtain good results, the IFFS algorithm has been adapted to the data characteristics by introducing new elements to the selection process like a reliability measure and a scoring system to better select the best feature at each iteration. The second feature selection approach is based on Ensemble learning, taking advantage of the microarryas feature abundance to implement a different selection scheme. New algorithms have been studied in this field, improving state of the art algorithms to the microarray data characteristic of small sample and high feature numbers. In addition to the binary classification problem, the multiclass case has been addressed too. A new algorithm combining multiple binary classifiers has been evaluated, exploiting the redundancy offered by multiple classifiers to obtain better predictions. All the studied algorithm throughout this thesis have been evaluated using high quality publicly available data, following established testing protocols from the literature to offer a proper benchmarking with the state of the art. Whenever possible, multiple Monte Carlo simulations have been performed to increase the robustness of the obtained results.En el campo de la biología computacional, los microarrays son utilizados para medir la actividad de miles de genes a la vez y producir una representación global de la función celular. Los microarrays permiten analizar la expresión de muchos genes en un solo experimento, rápidamente y eficazmente. Aunque los microarrays sean una tecnología de investigación consolidada hoy en día y la tendencia es en utilizar nuevas tecnologías como Next Generation Sequencing (NGS), aun no se ha encontrado un método óptimo para la clasificación de muestras. La clasificación de muestras de microarray es una tarea complicada, debido al alto número de variables y a la falta de estructura entre los datos. Esta característica impide la aplicación de técnicas de procesado que se basan en relaciones estructurales, como el filtrado con wavelet u otras técnicas de filltrado. Por otro lado, los genes no se expresen independientemente unos de otros: los genes están inter-relacionados según el proceso biológico que les regula. El objetivo de esta tesis es mejorar el estado del arte en la clasi cación de microarrays y contribuir a entender cómo se pueden diseñar y aplicar técnicas de procesado de señal para analizar microarrays. El objetivo de construir un algoritmo de clasi cación, necesita un estudio de comprobaciones y adaptaciones de algoritmos existentes a los datos analizados. Los algoritmo desarrollados en esta tesis encaran el problema con dos bloques esenciales. El primero ataca la falta de estructura, derivando un árbol binario usando herramientas de clustering no supervisado. El segundo elemento fundamental para obtener clasificadores precisos reduciendo el riesgo de overfitting es un elemento de selección de variables. La principal tarea en esta tesis es la clasificación de datos binarios en la cual hemos obtenido mejoras relevantes al estado del arte. El primer paso es la generación de una estructura, para eso se ha utilizado el algoritmo Treelets disponible en la literatura. Múltiples alternativas a este algoritmo original han sido propuestas y evaluadas, cambiando las métricas de similitud o las reglas de fusión durante el proceso. Además, se ha estudiado la posibilidad de usar fuentes de información externas, como ontologías de información biológica, para mejorar la inferencia de la estructura. Se han estudiado dos enfoques diferentes para la selección de variables: el primero es una modificación del algoritmo IFFS y el segundo utiliza un esquema de aprendizaje con “ensembles”. El algoritmo IFFS ha sido adaptado a las características de microarrays para obtener mejores resultados, añadiendo elementos como la medida de fiabilidad y un sistema de evaluación para seleccionar la mejor variable en cada iteración. El método que utiliza “ensembles” aprovecha la abundancia de features de los microarrays para implementar una selección diferente. En este campo se han estudiado diferentes algoritmos, mejorando alternativas ya existentes al escaso número de muestras y al alto número de variables, típicos de los microarrays. El problema de clasificación con más de dos clases ha sido también tratado al estudiar un nuevo algoritmo que combina múltiples clasificadores binarios. El algoritmo propuesto aprovecha la redundancia ofrecida por múltiples clasificadores para obtener predicciones más fiables. Todos los algoritmos propuestos en esta tesis han sido evaluados con datos públicos y de alta calidad, siguiendo protocolos establecidos en la literatura para poder ofrecer una comparación fiable con el estado del arte. Cuando ha sido posible, se han aplicado simulaciones Monte Carlo para mejorar la robustez de los resultados

    Machine learning applications for the topology prediction of transmembrane beta-barrel proteins

    Get PDF
    The research topic for this PhD thesis focuses on the topology prediction of beta-barrel transmembrane proteins. Transmembrane proteins adopt various conformations that are about the functions that they provide. The two most predominant classes are alpha-helix bundles and beta-barrel transmembrane proteins. Alpha-helix proteins are present in larger numbers than beta-barrel transmembrane proteins in structure databases. Therefore, there is a need to find computational tools that can predict and detect the structure of beta-barrel transmembrane proteins. Transmembrane proteins are used for active transport across the membrane or signal transduction. Knowing the importance of their roles, it becomes essential to understand the structures of the proteins. Transmembrane proteins are also a significant focus for new drug discovery. Transmembrane beta-barrel proteins play critical roles in the translocation machinery, pore formation, membrane anchoring, and ion exchange. In bioinformatics, many years of research have been spent on the topology prediction of transmembrane alpha-helices. The efforts to TMB (transmembrane beta-barrel) proteins topology prediction have been overshadowed, and the prediction accuracy could be improved with further research. Various methodologies have been developed in the past to predict TMB proteins topology. Methods developed in the literature that are available include turn identification, hydrophobicity profiles, rule-based prediction, HMM (Hidden Markov model), ANN (Artificial Neural Networks), radial basis function networks, or combinations of methods. The use of cascading classifier has never been fully explored. This research presents and evaluates approaches such as ANN (Artificial Neural Networks), KNN (K-Nearest Neighbors, SVM (Support Vector Machines), and a novel approach to TMB topology prediction with the use of a cascading classifier. Computer simulations have been implemented in MATLAB, and the results have been evaluated. Data were collected from various datasets and pre-processed for each machine learning technique. A deep neural network was built with an input layer, hidden layers, and an output. Optimisation of the cascading classifier was mainly obtained by optimising each machine learning algorithm used and by starting using the parameters that gave the best results for each machine learning algorithm. The cascading classifier results show that the proposed methodology predicts transmembrane beta-barrel proteins topologies with high accuracy for randomly selected proteins. Using the cascading classifier approach, the best overall accuracy is 76.3%, with a precision of 0.831 and recall or probability of detection of 0.799 for TMB topology prediction. The accuracy of 76.3% is achieved using a two-layers cascading classifier. By constructing and using various machine-learning frameworks, systems were developed to analyse the TMB topologies with significant robustness. We have presented several experimental findings that may be useful for future research. Using the cascading classifier, we used a novel approach for the topology prediction of TMB proteins

    Identification of pathway and gene markers using enhanced directed random walk for multiclass cancer expression data

    Get PDF
    Cancer markers play a significant role in the diagnosis of the origin of cancers and in the detection of cancers from initial treatments. This is a challenging task owing to the heterogeneity nature of cancers. Identification of these markers could help in improving the survival rate of cancer patients, in which dedicated treatment can be provided according to the diagnosis or even prevention. Previous investigations show that the use of pathway topology information could help in the detection of cancer markers from gene expression. Such analysis reduces its complexity from thousands of genes to a few hundreds of pathways. However, most of the existing methods group different cancer subtypes into just disease samples, and consider all pathways contribute equally in the analysis process. Meanwhile, the interaction between multiple genes and the genes with missing edges has been ignored in several other methods, and hence could lead to the poor performance of the identification of cancer markers from gene expression. Thus, this research proposes enhanced directed random walk to identify pathway and gene markers for multiclass cancer gene expression data. Firstly, an improved pathway selection with analysis of variances (ANOVA) that enables the consideration of multiple cancer subtypes is performed, and subsequently the integration of k-mean clustering and average silhouette method in the directed random walk that considers the interaction of multiple genes is also conducted. The proposed methods are tested on benchmark gene expression datasets (breast, lung, and skin cancers) and biological pathways. The performance of the proposed methods is then measured and compared in terms of classification accuracy and area under the receiver operating characteristics curve (AUC). The results indicate that the proposed methods are able to identify a list of pathway and gene markers from the datasets with better classification accuracy and AUC. The proposed methods have improved the classification performance in the range of between 1% and 35% compared with existing methods. Cell cycle and p53 signaling pathway were found significantly associated with breast, lung, and skin cancers, while the cell cycle was highly enriched with squamous cell carcinoma and adenocarcinoma

    Identification of protein functions using a machine-learning approach based on sequence-derived properties

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.</p> <p>Results</p> <p>A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.</p> <p>Conclusion</p> <p>We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new <it>PNPRD </it>features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.</p
    corecore