118 research outputs found

    Large-scale protein function prediction using heterogeneous ensembles [version 1; referees: 2 approved]

    Get PDF
    Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred (https://github.com/GauravPandeyLab/LargeGOPred)

    Statistical Issues in Machine Learning

    Get PDF
    Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest

    Individual and ensemble functional link neural networks for data classification

    Full text link
    This study investigated the Functional Link Neural Network (FLNN) for solving data classification problems. FLNN based models were developed using evolutionary methods as well as ensemble methods. The outcomes of the experiments covering benchmark classification problems, positively demonstrated the efficacy of the proposed models for undertaking data classification problems

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    Simple but Not Simplistic: Reducing the Complexity of Machine Learning Methods

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Resumo] A chegada do Big Data e a explosión do Internet das cousas supuxeron un gran reto para os investigadores en Aprendizaxe Automática, facendo que o proceso de aprendizaxe sexa mesmo roáis complexo. No mundo real, os problemas da aprendizaxe automática xeralmente teñen complexidades inherentes, como poden ser as características intrínsecas dos datos, o gran número de mostras, a alta dimensión dos datos de entrada, os cambios na distribución entre o conxunto de adestramento e test, etc. Todos estes aspectos son importantes, e requiren novoS modelos que poi dan facer fronte a estas situacións. Nesta tese, abordáronse todos estes problemas, tratando de simplificar o proceso de aprendizaxe automática no escenario actual. En primeiro lugar, realízase unha análise de complexidade para observar como inflúe esta na tarefa de clasificación, e se é posible que a aplicación dun proceso previo de selección de características reduza esta complexidade. Logo, abórdase o proceso de simplificación da fase de aprendizaxe automática mediante a filosofía divide e vencerás, usando un enfoque distribuído. Seguidamente, aplicamos esa mesma filosofía sobre o proceso de selección de características. Finalmente, optamos por un enfoque diferente seguindo a filosofía do Edge Computing, a cal permite que os datos producidos polos dispositivos do Internet das cousas se procesen máis preto de onde se crearon. Os enfoques propostos demostraron a súa capacidade para reducir a complexidade dos métodos de aprendizaxe automática tradicionais e, polo tanto, espérase que a contribución desta tese abra as portas ao desenvolvemento de novos métodos de aprendizaxe máquina máis simples, máis robustos, e máis eficientes computacionalmente.[Resumen] La llegada del Big Data y la explosión del Internet de las cosas han supuesto un gran reto para los investigadores en Aprendizaje Automático, haciendo que el proceso de aprendizaje sea incluso más complejo. En el mundo real, los problemas de aprendizaje automático generalmente tienen complejidades inherentes) como pueden ser las características intrínsecas de los datos, el gran número de muestras, la alta dimensión de los datos de entrada, los cambios en la distribución entre el conjunto de entrenamiento y test, etc. Todos estos aspectos son importantes, y requieren nuevos modelos que puedan hacer frente a estas situaciones. En esta tesis, se han abordado todos estos problemas, tratando de simplificar el proceso de aprendizaje automático en el escenario actual. En primer lugar, se realiza un análisis de complejidad para observar cómo influye ésta en la tarea de clasificación1 y si es posible que la aplicación de un proceso previo de selección de características reduzca esta complejidad. Luego, se aborda el proceso de simplificación de la fase de aprendizaje automático mediante la filosofía divide y vencerás, usando un enfoque distribuido. A continuación, aplicamos esa misma filosofía sobre el proceso de selección de características. Finalmente, optamos por un enfoque diferente siguiendo la filosofía del Edge Computing, la cual permite que los datos producidos por los dispositivos del Internet de las cosas se procesen más cerca de donde se crearon. Los enfoques propuestos han demostrado su capacidad para reducir la complejidad de los métodos de aprendizaje automático tnidicionales y, por lo tanto, se espera que la contribución de esta tesis abra las puertas al desarrollo de nuevos métodos de aprendizaje máquina más simples, más robustos, y más eficientes computacionalmente.[Abstract] The advent of Big Data and the explosion of the Internet of Things, has brought unprecedented challenges to Machine Learning researchers, making the learning task more complexo Real-world machine learning problems usually have inherent complexities, such as the intrinsic characteristics of the data, large number of instauces, high input dimensionality, dataset shift, etc. AH these aspects matter, and can fOI new models that can confront these situations. Thus, in this thesis, we have addressed aH these issues) simplifying the machine learning process in the current scenario. First, we carry out a complexity analysis to see how it inftuences the classification models, and if it is possible that feature selection might result in a deerease of that eomplexity. Then, we address the proeess of simplifying learning with the divide-and-conquer philosophy of the distributed approaeh. Later, we aim to reduce the complexity of the feature seleetion preprocessing through the same philosophy. FinallYl we opt for a different approaeh following the eurrent philosophy Edge eomputing, whieh allows the data produeed by Internet of Things deviees to be proeessed closer to where they were ereated. The proposed approaehes have demonstrated their eapability to reduce the complexity of traditional maehine learning algorithms, and thus it is expeeted that the eontribution of this thesis will open the doors to the development of new maehine learning methods that are simpler, more robust, and more eomputationally efficient

    Statistical learning in complex and temporal data: distances, two-sample testing, clustering, classification and Big Data

    Get PDF
    Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 555V01[Resumo] Esta tesis trata sobre aprendizaxe estatístico en obxetos complexos, con énfase en series temporais. O problema abórdase introducindo coñecemento sobre o dominio do fenómeno subxacente, mediante distancias e características. Proponse un contraste de dúas mostras basado en distancias e estúdase o seu funcionamento nun gran abanico de escenarios. As distancias para clasificación e clustering de series temporais acadan un incremento da potencia estatística cando se aplican a contrastes de dúas mostras. O noso test compárase de xeito favorable con outros métodos gracias á súa flexibilidade ante diferentes alternativas. Defínese unha nova distancia entre series temporais mediante un xeito innovador de comparar as distribucións retardadas das series. Esta distancia herda o bo funcionamento empírico doutros métodos pero elimina algunhas das súas limitacións. Proponse un método de predicción baseada en características das series. O método combina diferentes algoritmos estándar de predicción mediante unha suma ponderada. Os pesos desta suma veñen dun modelo que se axusta a un conxunto de entrenamento de gran tamaño. Propónse un método de clasificación distribuida, baseado en comparar, mediante unha distancia, as funcións de distribución empíricas do conxuto de proba común e as dos datos que recibe cada nodo de cómputo.[Resumen] Esta tesis trata sobre aprendizaje estadístico en objetos complejos, con énfasis en series temporales. El problema se aborda introduciendo conocimiento del dominio del fenómeno subyacente, mediante distancias y características. Se propone un test de dos muestras basado en distancias y se estudia su funcionamiento en un gran abanico de escenarios. La distancias para clasificación y clustering de series temporales consiguen un incremento de la potencia estadística cuando se aplican al tests de dos muestras. Nuestro test se compara favorablemente con otros métodos gracias a su flexibilidad antes diferentes alternativas. Se define una nueva distancia entre series temporales mediante una manera innovadora de comparar las distribuciones retardadas de la series. Esta distancia hereda el buen funcionamiento empírico de otros métodos pero elimina algunas de sus limitaciones. Se propone un método de predicción basado en características de las series. El método combina diferentes algoritmos estándar de predicción mediante una suma ponderada. Los pesos de esta suma salen de un modelo que se ajusta a un conjunto de entrenamiento de gran tamaño. Se propone un método de clasificación distribuida, basado en comparar, mediante una distancia, las funciones de distribución empírica del conjuto de prueba común y las de los datos que recibe cada nodo de cómputo.[Abstract] This thesis deals with the problem of statistical learning in complex objects, with emphasis on time series data. The problem is approached by facilitating the introduction of domain knoweldge of the underlying phenomena by means of distances and features. A distance-based two sample test is proposed, and its performance is studied under a wide range of scenarios. Distances for time series classification and clustering are also shown to increase statistical power when applied to two-sample testing. Our test compares favorably to other methods regarding its flexibility against different alternatives. A new distance for time series is defined by considering an innovative way of comparing lagged distributions of the series. This distance inherits the good empirical performance of existing methods while removing some of their limitations. A forecast method based on times series features is proposed. The method works by combining individual standard forecasting algorithms using a weighted average. These weights come from a learning model fitted on a large training set. A distributed classification algorithm is proposed, based on comparing, using a distance, the empirical distribution functions between the dataset that each computing node receives and the test set

    Understanding Random Forests: From Theory to Practice

    Get PDF
    Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].Comment: PhD thesis. Source code available at https://github.com/glouppe/phd-thesi

    Cancer proteogenomics : connecting genotype to molecular phenotype

    Get PDF
    The central dogma of molecular biology describes the one-way road from DNA to RNA and finally to protein. Yet, how this flow of information encoded in DNA as genes (genotype) is regulated in order to produce the observable traits of an individual (phenotype) remains unanswered. Recent advances in high-throughput data, i.e., ‘omics’, have allowed the quantification of DNA, RNA and protein levels leading to integrative analyses that essentially probe the central dogma along all of its constituent molecules. Evidence from these analyses suggest that mRNA abundances are at best a moderate proxy for proteins which are the main functional units of cells and thus closer to the phenotype. Cancer proteogenomic studies consider the ensemble of proteins, the so-called proteome, as the readout of the functional molecular phenotype to investigate its influence by upstream events, for example DNA copy number alterations. In typical proteogenomic studies, however, the identified proteome is a simplification of its actual composition, as they methodologically disregard events such as splicing, proteolytic cleavage and post-translational modifications that generate unique protein species – proteoforms. The scope of this thesis is to study the proteome diversity in terms of: a) the complex genetic background of three tumor types, i.e. breast cancer, childhood acute lymphoblastic leukemia and lung cancer, and b) the proteoform composition, describing a computational method for detecting protein species based on their distinct quantitative profiles. In Paper I, we present a proteogenomic landscape of 45 breast cancer samples representative of the five PAM50 intrinsic subtypes. We studied the effect of copy number alterations (CNA) on mRNA and protein levels, overlaying a public dataset of drug- perturbed protein degradation. In Paper II, we describe a proteogenomic analysis of 27 B-cell precursor acute lymphoblastic leukemia clinical samples that compares high hyperdiploid versus ETV6/RUNX1-positive cases. We examined the impact of the amplified chromosomes on mRNA and protein abundance, specifically the linear trend between the amplification level and the dosage effect. Moreover, we investigated mRNA-protein quantitative discrepancies with regard to post-transcriptional and post-translational effects such as mRNA/protein stability and miRNA targeting. In Paper III, we describe a proteogenomic cohort of 141 non-small cell lung cancer clinical samples. We used clustering methods to identify six distinct proteome-based subtypes. We integrated the protein abundances in pathways using protein-protein correlation networks, bioinformatically deconvoluted the immune composition and characterized the neoantigen burden. In Paper IV, we developed a pipeline for proteoform detection from bottom-up mass- spectrometry-based proteomics. Using an in-depth proteomics dataset of 18 cancer cell lines, we identified proteoforms related to splice variant peptides supported by RNA-seq data. This thesis adds on the previous literature of proteogenomic studies by analyzing the tumor proteome and its regulation along the flow of the central dogma of molecular biology. It is anticipated that some of these findings would lead to novel insights about tumor biology and set the stage for clinical applications to improve the current cancer patient care
    • …
    corecore