1,038,172 research outputs found
A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
The combination of multiple classifiers using ensemble methods is
increasingly important for making progress in a variety of difficult prediction
problems. We present a comparative analysis of several ensemble methods through
two case studies in genomics, namely the prediction of genetic interactions and
protein functions, to demonstrate their efficacy on real-world datasets and
draw useful conclusions about their behavior. These methods include simple
aggregation, meta-learning, cluster-based meta-learning, and ensemble selection
using heterogeneous classifiers trained on resampled data to improve the
diversity of their predictions. We present a detailed analysis of these methods
across 4 genomics datasets and find the best of these methods offer
statistically significant improvements over the state of the art in their
respective domains. In addition, we establish a novel connection between
ensemble selection and meta-learning, demonstrating how both of these disparate
methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013
International Conference on Data Minin
In silico comparative genomics analysis of Plasmodium falciparum for the identification of putative essential genes and therapeutic candidates.
A sequence of computational methods was used for predicting novel drug targets against drug resistant malaria parasite Plasmodium falciparum. Comparative genomics, orthologous protein analysis among same and other malaria parasites and protein-protein interaction study provide us new insights into determining the essential genes and novel therapeutic candidates. Among the predicted list of 21 essential proteins from unique pathways, 11 proteins were prioritized as anti-malarial drug targets. As a case study, we built homology models of two uncharacterized proteins using MODELLER v9.13 software from possible templates. Functional annotation of these proteins was done by the InterPro databases and from ProBiS server by comparison of predicted binding site residues. The model has been subjected to in silico docking study with screened potent lead compounds from the ZINC database by Dock Blaster software using AutoDock 4. Results from this study facilitate the selection of proteins and putative inhibitors for entry into drug design production pipelines
A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees
Sparse high dimensional graphical model selection is a topic of much interest
in modern day statistics. A popular approach is to apply l1-penalties to either
(1) parametric likelihoods, or, (2) regularized regression/pseudo-likelihoods,
with the latter having the distinct advantage that they do not explicitly
assume Gaussianity. As none of the popular methods proposed for solving
pseudo-likelihood based objective functions have provable convergence
guarantees, it is not clear if corresponding estimators exist or are even
computable, or if they actually yield correct partial correlation graphs. This
paper proposes a new pseudo-likelihood based graphical model selection method
that aims to overcome some of the shortcomings of current methods, but at the
same time retain all their respective strengths. In particular, we introduce a
novel framework that leads to a convex formulation of the partial covariance
regression graph problem, resulting in an objective function comprised of
quadratic forms. The objective is then optimized via a coordinate-wise
approach. The specific functional form of the objective function facilitates
rigorous convergence analysis leading to convergence guarantees; an important
property that cannot be established using standard results, when the dimension
is larger than the sample size, as is often the case in high dimensional
applications. These convergence guarantees ensure that estimators are
well-defined under very general conditions, and are always computable. In
addition, the approach yields estimators that have good large sample properties
and also respect symmetry. Furthermore, application to simulated/real data,
timing comparisons and numerical convergence is demonstrated. We also present a
novel unifying framework that places all graphical pseudo-likelihood methods as
special cases of a more general formulation, leading to important insights
Genetic heterogeneity analysis using genetic algorithm and network science
Through genome-wide association studies (GWAS), disease susceptible genetic
variables can be identified by comparing the genetic data of individuals with
and without a specific disease. However, the discovery of these associations
poses a significant challenge due to genetic heterogeneity and feature
interactions. Genetic variables intertwined with these effects often exhibit
lower effect-size, and thus can be difficult to be detected using machine
learning feature selection methods. To address these challenges, this paper
introduces a novel feature selection mechanism for GWAS, named Feature
Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous
subsets of genetic variables from a network constructed from multiple
independent feature selection runs based on a genetic algorithm (GA), an
evolutionary learning algorithm. We employ a non-linear machine learning
algorithm to detect feature interaction. We introduce the Community Risk Score
(CRS), a synthetic feature designed to quantify the collective disease
association of each variable subset. Our experiment showcases the effectiveness
of the utilized GA-based feature selection method in identifying feature
interactions through synthetic data analysis. Furthermore, we apply our novel
approach to a case-control colorectal cancer GWAS dataset. The resulting
synthetic features are then used to explain the genetic heterogeneity in an
additional case-only GWAS dataset
Distributed Quantile Regression Analysis and a Group Variable Selection Method
This dissertation develops novel methodologies for distributed quantile regression analysis
for big data by utilizing a distributed optimization algorithm called the alternating direction
method of multipliers (ADMM). Specifically, we first write the penalized quantile regression
into a specific form that can be solved by the ADMM and propose numerical algorithms
for solving the ADMM subproblems. This results in the distributed QR-ADMM
algorithm. Then, to further reduce the computational time, we formulate the penalized
quantile regression into another equivalent ADMM form in which all the subproblems have
exact closed-form solutions and hence avoid iterative numerical methods. This results in the
single-loop QPADM algorithm that further improve on the computational efficiency of the
QR-ADMM. Both QR-ADMM and QPADM enjoy flexible parallelization by enabling data
splitting across both sample space and feature space, which make them especially appealing
for the case when both sample size n and feature dimension p are large.
Besides the QR-ADMM and QPADM algorithms for penalized quantile regression, we
also develop a group variable selection method by approximating the Bayesian information
criterion. Unlike existing penalization methods for feature selection, our proposed gMIC
algorithm is free of parameter tuning and hence enjoys greater computational efficiency.
Although the current version of gMIC focuses on the generalized linear model, it can be
naturally extended to the quantile regression for feature selection.
We provide theoretical analysis for our proposed methods. Specifically, we conduct numerical
convergence analysis for the QR-ADMM and QPADM algorithms, and provide
asymptotical theories and oracle property of feature selection for the gMIC method. All
our methods are evaluated with simulation studies and real data analysis
Using Pythagorean Fuzzy Sets (PFS) in Multiple Criteria Group Decision Making (MCGDM) Methods for Engineering Materials Selection Applications
The process of materials’ selection is very critical during the initial stages of designing manufactured products. Inefficient decision-making outcomes in the material selection process could result in poor quality of products and unnecessary costs. In the last century, numerous materials have been developed for manufacturing mechanical components in different industries. Many of these new materials are similar in their properties and performances, thus creating great challenges for designers and engineers to make accurate selections. Our main objective in this work is to assist decision makers (DMs) within the manufacturing field to evaluate materials alternatives and to select the best alternative for specific manufacturing purposes.
In this research, new hybrid fuzzy Multiple Criteria Group Decision Making (MCGDM) methods are proposed for the material selection problem. The proposed methods tackle some challenges that are associated with the material selection decision making process, such as aggregating decision makers’ (DMs) decisions appropriately and modeling uncertainty. In the proposed hybrid models, a novel aggregation approach is developed to convert DMs crisp decisions to Pythagorean fuzzy sets (PFS). This approach gives more flexibility to DMs to express their opinions than the traditional fuzzy and intuitionistic sets (IFS). Then, the proposed aggregation approach is integrated with a ranking method to solve the Pythagorean Fuzzy Multi Criteria Decision Making (PFMCGDM) problem and rank the material alternatives. The ranking methods used in the hybrid models are the Pythagorean Fuzzy TOPSIS (The Technique for Order of Preference by Similarity to Ideal Solution) and Pythagorean Fuzzy COPRAS (COmplex PRoportional Assessment). TOPSIS and COPRAS are selected based on their effectiveness and practicality in dealing with the nature of material selection problems.
In the aggregation approach, the Sugeno Fuzzy measure and the Shapley value are used to fairly distribute the DMs weight in the Pythagorean Fuzzy numbers. Additionally, new functions to calculate uncertainty from DMs recommendations are developed using the Takagai-Sugeno approach. The literature reveals some work on these methods, but to our knowledge, there are no published works that integrate the proposed aggregation approach with the selected MCDM ranking methods under the Pythagorean Fuzzy environment for the use in materials selection problems. Furthermore, the proposed methods might be applied, due to its novelty, to any MCDM problem in other areas.
A practical validation of the proposed hybrid PFMCGDM methods is investigated through conducting a case study of material selection for high pressure turbine blades in jet engines. The main objectives of the case study were: 1) to investigate the new developed aggregation approach in converting real DMs crisp decisions into Pythagorean fuzzy numbers; 2) to test the applicability of both the hybrid PFMCGDM TOPSIS and the hybrid PFMCGDM COPRAS methods in the field of material selection.
In this case study, a group of five DMs, faculty members and graduate students, from the Materials Science and Engineering Department at the University of Wisconsin-Milwaukee, were selected to participate as DMs. Their evaluations fulfilled the first objective of the case study. A computer application for material selection was developed to assist designers and engineers in real life problems. A comparative analysis was performed to compare the results of both hybrid MCGDM methods. A sensitivity analysis was conducted to show the robustness and reliability of the outcomes obtained from both methods. It is concluded that using the proposed hybrid PFMCGDM TOPSIS method is more effective and practical in the material selection process than the proposed hybrid PFMCGDM COPRAS method. Additionally, recommendations for further research are suggested
- …