13 research outputs found

    Using Biological Networks and Gene-Expression Profiles for the Analysis of Diseases

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy.

    Full text link
    Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning have been accomplished: two classifiers providing the same performance in one validation can disagree on many future validations. It does not provide explainability in its decision-making process and is not objective, as its value is also affected by class proportions in the validation set. Despite these issues, this does not mean we should omit the class-prediction accuracy. Instead, it needs to be enriched with accompanying evidence and tests that supplement and contextualize the reported accuracy. This additional evidence serves as augmentations and can help us perform machine learning better while avoiding naive reliance on oversimplified metrics

    Incorporating Pathway Information into Feature Selection Towards Better Performed Gene Signatures

    Get PDF
    To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable

    Weighted-SAMGSR: Combining Significance Analysis of Microarray-Gene Set Reduction Algorithm with Pathway Topology-Based Weights to Select Relevant Genes

    Get PDF
    Background: It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method. Methods: In SAMGSR, whether a gene is selected is mainly determined by its expression difference between the phenotypes, and partially by the number of pathways to which this gene belongs. It ignores the topology information among pathways. In this study, we propose a weighted version of the SAMGSR algorithm by constructing weights based on the connectivity among genes and then combing these weights with the test statistics. Results: Using both simulated and real-world data, we evaluate the performance of the proposed SAMGSR extension and demonstrate that the weighted version outperforms its original version. Conclusions: To conclude, the additional gene connectivity information does faciliatate feature selection

    Identification of Prognostic Genes and Gene Sets for Early-Stage Non-Small Cell Lung Cancer Using Bi-Level Selection Methods

    Get PDF
    In contrast to feature selection and gene set analysis, bi-level selection is a process of selecting not only important gene sets but also important genes within those gene sets. Depending on the order of selections, a bi-level selection method can be classified into three categories – forward selection, which first selects relevant gene sets followed by the selection of relevant individual genes; backward selection which takes the reversed order; and simultaneous selection, which performs the two tasks simultaneously usually with the aids of a penalized regression model. To test the existence of subtype-specific prognostic genes for non-small cell lung cancer (NSCLC), we had previously proposed the Cox-filter method that examines the association between patients’ survival time after diagnosis with one specific gene, the disease subtypes, and their interaction terms. In this study, we further extend it to carry out forward and backward bi-level selection. Using simulations and a NSCLC application, we demonstrate that the forward selection outperforms the backward selection and other relevant algorithms in our setting. Both proposed methods are readily understandable and interpretable. Therefore, they represent useful tools for the researchers who are interested in exploring the prognostic value of gene expression data for specific subtypes or stages of a disease

    Computational Proteomics Using Network-Based Strategies

    Get PDF
    This thesis examines the productive application of networks towards proteomics, with a specific biological focus on liver cancer. Contempory proteomics (shot- gun) is plagued by coverage and consistency issues. These can be resolved via network-based approaches. The application of 3 classes of network-based approaches are examined: A traditional cluster based approach termed Proteomics Expansion Pipeline), a generalization of PEP termed Maxlink and a feature-based approach termed Proteomics Signature Profiling. PEP is an improvement on prevailing cluster-based approaches. It uses a state- of-the-art cluster identification algorithm as well as network-cleaning approaches to identify the critical network regions indicated by the liver cancer data set. The top PARP1 associated-cluster was identified and independently validated. Maxlink allows identification of undetected proteins based on the number of links to identified differential proteins. It is more sensitive than PEP due to more relaxed requirements. Here, the novel roles of ARRB1/2 and ACTB are identified and discussed in the context of liver cancer. Both PEP and Maxlink are unable to deal with consistency issues, PSP is the first method able to deal with both, and is termed feature-based since the network- based clusters it uses are predicted independently of the data. It is also capable of using real complexes or predicted pathway subnets. By combining pathways and complexes, a novel basis of liver cancer progression implicating nucleotide pool imbalance aggravated by mutations of key DNA repair complexes was identified. Finally, comparative evaluations suggested that pure network-based methods are vastly outperformed by feature-based network methods utilizing real complexes. This is indicative that the quality of current networks are insufficient to provide strong biological rigor for data analysis, and should be carefully evaluated before further validations.Open Acces
    corecore