1,038,172 research outputs found

    A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

    Full text link
    The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013 International Conference on Data Minin

    In silico comparative genomics analysis of Plasmodium falciparum for the identification of putative essential genes and therapeutic candidates.

    No full text
    A sequence of computational methods was used for predicting novel drug targets against drug resistant malaria parasite Plasmodium falciparum. Comparative genomics, orthologous protein analysis among same and other malaria parasites and protein-protein interaction study provide us new insights into determining the essential genes and novel therapeutic candidates. Among the predicted list of 21 essential proteins from unique pathways, 11 proteins were prioritized as anti-malarial drug targets. As a case study, we built homology models of two uncharacterized proteins using MODELLER v9.13 software from possible templates. Functional annotation of these proteins was done by the InterPro databases and from ProBiS server by comparison of predicted binding site residues. The model has been subjected to in silico docking study with screened potent lead compounds from the ZINC database by Dock Blaster software using AutoDock 4. Results from this study facilitate the selection of proteins and putative inhibitors for entry into drug design production pipelines

    A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees

    Get PDF
    Sparse high dimensional graphical model selection is a topic of much interest in modern day statistics. A popular approach is to apply l1-penalties to either (1) parametric likelihoods, or, (2) regularized regression/pseudo-likelihoods, with the latter having the distinct advantage that they do not explicitly assume Gaussianity. As none of the popular methods proposed for solving pseudo-likelihood based objective functions have provable convergence guarantees, it is not clear if corresponding estimators exist or are even computable, or if they actually yield correct partial correlation graphs. This paper proposes a new pseudo-likelihood based graphical model selection method that aims to overcome some of the shortcomings of current methods, but at the same time retain all their respective strengths. In particular, we introduce a novel framework that leads to a convex formulation of the partial covariance regression graph problem, resulting in an objective function comprised of quadratic forms. The objective is then optimized via a coordinate-wise approach. The specific functional form of the objective function facilitates rigorous convergence analysis leading to convergence guarantees; an important property that cannot be established using standard results, when the dimension is larger than the sample size, as is often the case in high dimensional applications. These convergence guarantees ensure that estimators are well-defined under very general conditions, and are always computable. In addition, the approach yields estimators that have good large sample properties and also respect symmetry. Furthermore, application to simulated/real data, timing comparisons and numerical convergence is demonstrated. We also present a novel unifying framework that places all graphical pseudo-likelihood methods as special cases of a more general formulation, leading to important insights

    Genetic heterogeneity analysis using genetic algorithm and network science

    Full text link
    Through genome-wide association studies (GWAS), disease susceptible genetic variables can be identified by comparing the genetic data of individuals with and without a specific disease. However, the discovery of these associations poses a significant challenge due to genetic heterogeneity and feature interactions. Genetic variables intertwined with these effects often exhibit lower effect-size, and thus can be difficult to be detected using machine learning feature selection methods. To address these challenges, this paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous subsets of genetic variables from a network constructed from multiple independent feature selection runs based on a genetic algorithm (GA), an evolutionary learning algorithm. We employ a non-linear machine learning algorithm to detect feature interaction. We introduce the Community Risk Score (CRS), a synthetic feature designed to quantify the collective disease association of each variable subset. Our experiment showcases the effectiveness of the utilized GA-based feature selection method in identifying feature interactions through synthetic data analysis. Furthermore, we apply our novel approach to a case-control colorectal cancer GWAS dataset. The resulting synthetic features are then used to explain the genetic heterogeneity in an additional case-only GWAS dataset

    Distributed Quantile Regression Analysis and a Group Variable Selection Method

    Get PDF
    This dissertation develops novel methodologies for distributed quantile regression analysis for big data by utilizing a distributed optimization algorithm called the alternating direction method of multipliers (ADMM). Specifically, we first write the penalized quantile regression into a specific form that can be solved by the ADMM and propose numerical algorithms for solving the ADMM subproblems. This results in the distributed QR-ADMM algorithm. Then, to further reduce the computational time, we formulate the penalized quantile regression into another equivalent ADMM form in which all the subproblems have exact closed-form solutions and hence avoid iterative numerical methods. This results in the single-loop QPADM algorithm that further improve on the computational efficiency of the QR-ADMM. Both QR-ADMM and QPADM enjoy flexible parallelization by enabling data splitting across both sample space and feature space, which make them especially appealing for the case when both sample size n and feature dimension p are large. Besides the QR-ADMM and QPADM algorithms for penalized quantile regression, we also develop a group variable selection method by approximating the Bayesian information criterion. Unlike existing penalization methods for feature selection, our proposed gMIC algorithm is free of parameter tuning and hence enjoys greater computational efficiency. Although the current version of gMIC focuses on the generalized linear model, it can be naturally extended to the quantile regression for feature selection. We provide theoretical analysis for our proposed methods. Specifically, we conduct numerical convergence analysis for the QR-ADMM and QPADM algorithms, and provide asymptotical theories and oracle property of feature selection for the gMIC method. All our methods are evaluated with simulation studies and real data analysis

    Using Pythagorean Fuzzy Sets (PFS) in Multiple Criteria Group Decision Making (MCGDM) Methods for Engineering Materials Selection Applications

    Get PDF
    The process of materials’ selection is very critical during the initial stages of designing manufactured products. Inefficient decision-making outcomes in the material selection process could result in poor quality of products and unnecessary costs. In the last century, numerous materials have been developed for manufacturing mechanical components in different industries. Many of these new materials are similar in their properties and performances, thus creating great challenges for designers and engineers to make accurate selections. Our main objective in this work is to assist decision makers (DMs) within the manufacturing field to evaluate materials alternatives and to select the best alternative for specific manufacturing purposes. In this research, new hybrid fuzzy Multiple Criteria Group Decision Making (MCGDM) methods are proposed for the material selection problem. The proposed methods tackle some challenges that are associated with the material selection decision making process, such as aggregating decision makers’ (DMs) decisions appropriately and modeling uncertainty. In the proposed hybrid models, a novel aggregation approach is developed to convert DMs crisp decisions to Pythagorean fuzzy sets (PFS). This approach gives more flexibility to DMs to express their opinions than the traditional fuzzy and intuitionistic sets (IFS). Then, the proposed aggregation approach is integrated with a ranking method to solve the Pythagorean Fuzzy Multi Criteria Decision Making (PFMCGDM) problem and rank the material alternatives. The ranking methods used in the hybrid models are the Pythagorean Fuzzy TOPSIS (The Technique for Order of Preference by Similarity to Ideal Solution) and Pythagorean Fuzzy COPRAS (COmplex PRoportional Assessment). TOPSIS and COPRAS are selected based on their effectiveness and practicality in dealing with the nature of material selection problems. In the aggregation approach, the Sugeno Fuzzy measure and the Shapley value are used to fairly distribute the DMs weight in the Pythagorean Fuzzy numbers. Additionally, new functions to calculate uncertainty from DMs recommendations are developed using the Takagai-Sugeno approach. The literature reveals some work on these methods, but to our knowledge, there are no published works that integrate the proposed aggregation approach with the selected MCDM ranking methods under the Pythagorean Fuzzy environment for the use in materials selection problems. Furthermore, the proposed methods might be applied, due to its novelty, to any MCDM problem in other areas. A practical validation of the proposed hybrid PFMCGDM methods is investigated through conducting a case study of material selection for high pressure turbine blades in jet engines. The main objectives of the case study were: 1) to investigate the new developed aggregation approach in converting real DMs crisp decisions into Pythagorean fuzzy numbers; 2) to test the applicability of both the hybrid PFMCGDM TOPSIS and the hybrid PFMCGDM COPRAS methods in the field of material selection. In this case study, a group of five DMs, faculty members and graduate students, from the Materials Science and Engineering Department at the University of Wisconsin-Milwaukee, were selected to participate as DMs. Their evaluations fulfilled the first objective of the case study. A computer application for material selection was developed to assist designers and engineers in real life problems. A comparative analysis was performed to compare the results of both hybrid MCGDM methods. A sensitivity analysis was conducted to show the robustness and reliability of the outcomes obtained from both methods. It is concluded that using the proposed hybrid PFMCGDM TOPSIS method is more effective and practical in the material selection process than the proposed hybrid PFMCGDM COPRAS method. Additionally, recommendations for further research are suggested
    • …
    corecore