332 research outputs found

    Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction

    Get PDF
    This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.The study was partially funded by the “Accion Transversal del Cancer”, approved on the Spanish Ministry Council on the 11th October 2007, by the Instituto de Salud Carlos III-FEDER (PI08/1770, PI08/0533, PI08/1359, PS09/00773, PS09/01286, PS09/01903, PS09/02078, PS09/01662, PI11/01403, PI11/01889, PI11/00226, PI11/01810, PI11/02213, PI12/00488, PI12/00265, PI12/01270, PI12/00715, PI12/00150), by the Fundación Marqués de Valdecilla (API 10/09), by the ICGC International Cancer Genome Consortium CLL, by the Junta de Castilla y León (LE22A10-2), by the Consejería de Salud of the Junta de Andalucía (PI-0571), by the Conselleria de Sanitat of the Generalitat Valenciana (AP 061/10), by the Recercaixa (2010ACUP 00310), by the Regional Government of the Basque Country by European Commission grants FOOD-CT- 2006-036224- HIWATE, by the Spanish Association Against Cancer (AECC) Scientific Foundation, by the The Catalan Government DURSI grant 2009SGR1489. Samples: Biological samples were stored at the Parc de Salut MAR Biobank (MARBiobanc; Barcelona) which is supported by Instituto de Salud Carlos III FEDER (RD09/0076/00036). Furthermore, at the Public Health Laboratory from Gipuzkoa and the Basque Biobank. Furthermore, sample collection was supported by the Xarxa de Bancs de Tumors de Catalunya sponsored by Pla Director d’Oncologia de Catalunya (XBTC). Biological samples were stored at the “Biobanco La Fe” which is supported by Instituto de Salud Carlos III (RD 09 0076/00021) and FISABIO biobanking, which is supported by Instituto de Salud Carlos III (RD09 0076/00058).S

    Prediction of dyslipidemia using gene mutations, family history of diseases and anthropometric indicators in children and adolescents: The CASPIAN-III study

    Get PDF
    Dyslipidemia, the disorder of lipoprotein metabolism resulting in high lipid profile, is an important modifiable risk factor for coronary heart diseases. It is associated with more than four million worldwide deaths per year. Half of the children with dyslipidemia have hyperlipidemia during adulthood, and its prediction and screening are thus critical. We designed a new dyslipidemia diagnosis system. The sample size of 725 subjects (age 14.66¿±¿2.61 years; 48% male; dyslipidemia prevalence of 42%) was selected by multistage random cluster sampling in Iran. Single nucleotide polymorphisms (rs1801177, rs708272, rs320, rs328, rs2066718, rs2230808, rs5880, rs5128, rs2893157, rs662799, and Apolipoprotein-E2/E3/E4), and anthropometric, life-style attributes, and family history of diseases were analyzed. A framework for classifying mixed-type data in imbalanced datasets was proposed. It included internal feature mapping and selection, re-sampling, optimized group method of data handling using convex and stochastic optimizations, a new cost function for imbalanced data and an internal validation. Its performance was assessed using hold-out and 4-foldcross-validation. Four other classifiers namely as supported vector machines, decision tree, and multilayer perceptron neural network and multiple logistic regression were also used. The average sensitivity, specificity, precision and accuracy of the proposed system were 93%, 94%, 94% and 92%, respectively in cross validation. It significantly outperformed the other classifiers and also showed excellent agreement and high correlation with the gold standard. A non-invasive economical version of the algorithm was also implemented suitable for low- and middle-income countries. It is thus a promising new tool for the prediction of dyslipidemiaPeer ReviewedPostprint (published version

    Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic Medicine and Biomedical Research

    Get PDF
    Advances in sequencing technology have significantly contributed to shaping the area of genetics and enabled the identification of genetic variants associated with complex traits through genome-wide association studies. This has provided insights into genetic medicine, in which case, genetic factors influence variability in disease and treatment outcomes. On the other side, the missing or hidden heritability has suggested that the host quality of life and other environmental factors may also influence differences in disease risk and drug/treatment responses in genomic medicine, and orient biomedical research, even though this may be highly constrained by genetic capabilities. It is expected that combining these different factors can yield a paradigm-shift of personalized medicine and lead to a more effective medical treatment. With existing “big data” initiatives and high-performance computing infrastructures, there is a need for data-driven learning algorithms and models that enable the selection and prioritization of relevant genetic variants (post-genomic medicine) and trigger effective translation into clinical practice. In this chapter, we survey and discuss existing machine learning algorithms and post-genomic analysis models supporting the process of identifying valuable markers

    Feature subset selection using support-vector machines by averaging over probabilistic genotype data

    Get PDF
    Despite the grand promises of the postgenomic era, such as personalized prevention, diagnosis, drugs, and treatments, the landscape of biomedicine looks more and more complex. The fullfillment of these promises for diseases significant in public health requires new approaches to induction for statistical and causal inferences from observations and interventions. Within the biomedical world an important response to this challenge is the mapping and relatively cheap measuring of the genetic variations, such as single nucleotide polymorphisms (SNPs). The recent mapping of the genetic variations has opened a new dimension in the postgenomic research at all phenotypic levels, such as genomic, proteomic, and clinical, and it has sparked a series of Genetic Association Studies (GAS), based on the application of machine learning and data mining techniques. To overcome such problems, different strategies are being investigated within the research community. The aim of this thesis work is to contribute to the progress in this field giving a step forward towards the solution. I have investigated the suitable machine learning and data mining algorithms for this task and the state of the art of the currently available implementations of them intended for biomedical research applications. As a result I have proposed a solution strategy, and chosen and extended the functionality of the Java-ML library, an open source machine learning library written in Java, implementing some missing algorithms and functionality that necessary for the proposed approach. This thesis work is structured into three main blocks. Section 3 “An approach to the use of machine learning techniques with genotype data” addresses the faced problem and the proposed solution. It begins with the definition of some introductory GAS concepts and the description of the solution strategy and elaborates in subsequent subsections on the description of the theoretical underpinnings of the algorithms setting up the solution. Specifically, the first subsection, “The feature selection problem in the bioinformatics domain”, justifies the necessity of reducing the dimensionality of data sets in order to allow for acceptable performance in the application of machine learning techniques to the broader field of bioinformatics implications and establishes a comparative taxonomy of the currently available techniques. In the second subsection, entitled “Feature selection using support-vector machines”, the idea behind support-vector machines classifiers and their application to feature subset selection is defined while the third subsection, “Ranking fusion as averaging technique: Markov chain based algorithms”, describes the ranking fusion algorithms which implementation has been chosen for the combination of the feature subsets obtained from different data sets. Section 4 “Analysis of available tools for experimental design” analyses the available suitable tools for experimental design in GAS based on machine learning techniques. In this sense in the first subsection, “Advantages of high level languages for machine learning algorithms”, the convenience of using high level languages for the kind of applications we are working in is discussed. In the second subsection, “Machine learning algorithms implementations in Java”, the election of the Java language is justified followed by an analysis of the currently available implementations of machine learning algorithms in this language that are worthwhile to be considered for our purposes, namely WEKA, RapidMiner and Java-ML. In Section 5 “Implemented extensions to the Java-ML library” a description of the functionalities that have been added to enable a framework suitable for the design of GAS experiments in order to test the proposed approach is provided. The “Missing values imputation: the dataset.tools package” subsection focuses on data sets handling functionalities while the “Averaging through ranking fusion: rankingfusion and rankingfusion.scoring package” subsection details the ranking fusion algorithms implementations. Finally the “How to use the code” subsection is a tutorial on how to use both the library and its extension for the development of applications. In addition to these main blocks, a final section called “Future Work” reflects how the developed work can be used by GAS domain experts to evaluate the usefulness of the proposed technique.Ingeniería de Telecomunicació

    Tempering the Adversary: An Exploration into the Applications of Game Theoretic Feature Selection and Regression

    Get PDF
    Most modern machine learning algorithms tend to focus on an average-case approach, where every data point contributes the same amount of influence towards calculating the fit of a model. This per-data point error (or loss) is averaged together into an overall loss and typically minimized with an objective function. However, this can be insensitive to valuable outliers. Inspired by game theory, the goal of this work is to explore the utility of incorporating an optimally-playing adversary into feature selection and regression frameworks. The adversary assigns weights to the data elements so as to degrade the modeler\u27s performance in an optimal manner, thereby forcing the modeler to construct a more robust solution. A tuning parameter enables tempering of the power wielded by the adversary, allowing us to explore the spectrum between average case and worst case. By formulating our method as a linear program, it can be solved efficiently, and can accommodate sub-population constraints, a feature that other related methods cannot easily implement. We feel that the need to generate models while understanding the influence of sub-population constraints should be particularly prominent in biomedical literature, and though our method was developed in response to the ubiquity of sub-population data and outliers that exist in this realm, our method is generic and can be applied to data sets that are not exclusively biomedical in nature. We additionally explore the implementation of our method as an adversarial regression problem. Here, instead of providing the user with a fitting of parameters for the model, we provide the user with an ensemble of parameters which can be tuned based on sensitivity to outliers and various sub-population constraints. Finally, to help foster a better understanding of various data sets, we will discuss potential automated applications of our method which will enable data scientists to explore underlying relationships and sensitivities that may be a consequence of sub-populations and meaningful outliers

    The computational hardness of feature selection in strict-pure synthetic genetic datasets

    Get PDF
    A common task in knowledge discovery is finding a few features correlated with an outcome in a sea of mostly irrelevant data. This task is particularly formidable in genetic datasets containing thousands to millions of Single Nucleotide Polymorphisms (SNPs) for each individual; the goal here is to find a small subset of SNPs correlated with whether an individual is sick or healthy(labeled data). Although determining a correlation between any given SNP (genotype) and a disease label (phenotype) is relatively straightforward, detecting subsets of SNPs such that the correlation is only apparent when the whole subset is considered seems to be much harder. In this thesis, we study the computational hardness of this problem, in particular for a widely used method of generating synthetic SNP datasets. More specifically, we consider the feature selection problem in datasets generated by ”pure and strict” models, such as ones produced by the popular GAMETES software. In these datasets, there is a high correlation between a predefined target set of features (SNPs) and a label; however, any subset of the target set appears uncorrelated with the outcome. Our main result is a (linear-time, parameter-preserving) reduction from the well-known Learning Parity with Noise (LPN) problem to feature selection in such pure and strict datasets. This gives us a host of consequences for the complexity of feature selection in this setting. First, not only it is NP-hard (to even approximate), it is computationally hard on average under a standard cryptographic assumption on hardness on learning parity with noise; moreover, in general it is as hard for the uniform distribution as for arbitrary distributions, and as hard for random noise as for adversarial noise. For the worst case complexity, we get a tighter parameterized lower bound: even in the non-noisy case, finding a parity of Hamming weight at most k is W[1]-hard when the number of samples is relatively small (logarithmic in the number of features). Finally, most relevant to the development of feature selection heuristics, by the unconditional hardness of LPN in Kearns’ statistical query model, no heuristic that only computes statistics about the samples rather than considering samples themselves, can successfully perform feature selection in such pure and strict datasets. This eliminates a large class of common approaches to feature selection

    An Overview on Artificial Intelligence Techniques for Diagnosis of Schizophrenia Based on Magnetic Resonance Imaging Modalities: Methods, Challenges, and Future Works

    Full text link
    Schizophrenia (SZ) is a mental disorder that typically emerges in late adolescence or early adulthood. It reduces the life expectancy of patients by 15 years. Abnormal behavior, perception of emotions, social relationships, and reality perception are among its most significant symptoms. Past studies have revealed the temporal and anterior lobes of hippocampus regions of brain get affected by SZ. Also, increased volume of cerebrospinal fluid (CSF) and decreased volume of white and gray matter can be observed due to this disease. The magnetic resonance imaging (MRI) is the popular neuroimaging technique used to explore structural/functional brain abnormalities in SZ disorder owing to its high spatial resolution. Various artificial intelligence (AI) techniques have been employed with advanced image/signal processing methods to obtain accurate diagnosis of SZ. This paper presents a comprehensive overview of studies conducted on automated diagnosis of SZ using MRI modalities. Main findings, various challenges, and future works in developing the automated SZ detection are described in this paper
    corecore