8 research outputs found

    Handling Imbalanced Classes: Feature Based Variance Ranking Techniques for Classification

    Get PDF
    To obtain good predictions in the presence of imbalance classes has posed significant challenges in the data science community. Imbalanced classed data is a term used to describe a situation where there are unequal number of classes or groups in datasets. In most real-life datasets one of the classes are always higher in number than others and is called the majority class, while the smaller classes are called the minority class. During classifications even with very high accuracy, the classified minority groups are usually very small when compared to the total number of minority in the datasets and more often than not, the minority classes are what is being sought. This work is specifically concern with providing techniques to improve classifications performance by eliminating or reducing negative effects of class imbalance. Real-life datasets have been found to contain different types of error in combination with class imbalance. While these errors are easily corrected, but the solutions to class imbalance have remained elusive. Previously, machine learning (ML) technique has been used to solve the problems of class imbalanced. There are notable shortcomings that have been identified while using this technique. Mostly, it involve fine-tuning and changing parameters of the algorithms and this process is not standardised because of countless numbers of algorithms and parameters. In general, the results obtained from these unstandardised (ML) technique are very inconsistent and cannot be replicated with similar datasets and algorithms. We present a novel technique for dealing with imbalanced classes called variance ranking features selection, that enables machine learning algorithms to classify more of minority classes during classification, hence reducing the negative effects of class imbalance. Our approaches utilised the intrinsic property of the datasets called the variance. As the variance is one of the measures of central tendency of the data items concentration within the datasets vector space. We demonstrated the selections of features at different level of performance threshold thereby providing an opportunity for performance and feature significance to be assessed and correlated at different levels of prediction. In the evaluations we compared our features selections with some of the best known features selections techniques using proximity distance comparison techniques and verify all the results with different datasets, both binary and multi classed with varying degree of class imbalance. In all the experiments, the results we obtained showed a significant improvement when compared with other previous work in class imbalance

    Towards the Next Generation of Clinical Decision Support: Overcoming the Integration Challenges of Genomic Data and Electronic Health Records

    Get PDF
    The wide adoption of electronic health records (EHRs), the unprecedented abundance of genomic data, and the rapid advancements in computational methods have paved the way for next generation clinical decision support (NGCDS) systems. NGCDS provides significant opportunities for the prevention, early detection, and the personalized treatment of complex diseases. The integration of genomic and EHR data into the NGCDS workflow is faced with significant challenges due to the high complexity and sheer magnitude of the associated data. This dissertation performs an in depth investigation to address the computational and algorithmic challenges of integrating genomic and EHR data within the NGCDS workflow. In particular, the dissertation (i) defines the major genomic challenges NGCDS faces and discusses possible resolution directions, (ii) proposes an accelerated method for processing raw genomic data, (iii) introduces a data representation and compression method to store the processed genomic outcomes in a database schema, and finally, (iv) investigates the feasibility of using EHR data to produce accurate disease risk assessments. We hope that the proposed solutions will expedite the adoption of NGCDS and help advance the state of healthcare

    Detection of somatic variants from genomic data and their role in neurodegenerative diseases

    Get PDF
    [eng] Somatic mutations are those that arise after the zygote is formed and are therefore inherited by a fraction of the cells of an individual. Their relevance in certain skin diseases has been known for almost half a decade and cancer, the most common disease caused by somatic mutations, has been extensively studied. Yet, their prevalence in healthy individuals as well as their putative role in other human disorders such as neurodegenerative diseases are still unanswered questions. Furthermore, accurate detection of somatic variants from bulk sequencing data still poses a technical challenge. This work focuses on detecting and circumventing the biases that hinder their identification. Using this knowledge, we identified somatic point mutations in the exomes of five different tissues from sporadic Parkinson disease patients. We also assessed the detection of somatic copy number variants from array CGH data using two tissues from Alzheimer disease patients. Finally, we participated in the identification of somatic variants in an extensive genomic dataset from a neurotypical individual.[spa] Las mutaciones somáticas son aquellas que surgen tras la formación del cigoto y son, por tanto, heredadas por una fracción de las células de un individuo. Su importancia en algunas enfermedades cutáneas se conoce desde hace casi medio siglo. El cáncer, la enfermedad más común causada por mutaciones somáticas, se ha estudiado extensamente. Sin embargo, su prevalencia en individuos sanos, así como su potencial relevancia en otras afecciones humanas, como las enfermedades neurodegenerativas, son cuestiones todavía por resolver. Asimismo, detectar variantes somáticas con precisión en datos de secuenciación de muestras homogeneizadas sigue siendo complejo técnicamente. Este trabajo se centra en la detección y resolución de los sesgos que dificultan su identificación. Aplicando este conocimiento, identificamos mutaciones somáticas de una sola base en datos de secuenciación del exoma de cinco tejidos diferentes de pacientes de la enfermedad de Parkinson. También evaluamos la detección de variantes de número de copia somáticas en datos de array CGH de dos tejidos de pacientes de Alzheimer. Finalmente, participamos en la identificación de variantes somáticas en un amplio conjunto de datos genómicos de un individuo neurotípico

    A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset.

    Full text link
    BACKGROUND: In the age of information superhighway, big data play a significant role in information processing, extractions, retrieving and management. In computational biology, the continuous challenge is to manage the biological data. Data mining techniques are sometimes imperfect for new space and time requirements. Thus, it is critical to process massive amounts of data to retrieve knowledge. The existing software and automated tools to handle big data sets are not sufficient. As a result, an expandable mining technique that enfolds the large storage and processing capability of distributed or parallel processing platforms is essential. METHOD: In this analysis, a contemporary distributed clustering methodology for imbalance data reduction using k-nearest neighbor (K-NN) classification approach has been introduced. The pivotal objective of this work is to illustrate real training data sets with reduced amount of elements or instances. These reduced amounts of data sets will ensure faster data classification and standard storage management with less sensitivity. However, general data reduction methods cannot manage very big data sets. To minimize these difficulties, a MapReduce-oriented framework is designed using various clusters of automated contents, comprising multiple algorithmic approaches. RESULTS: To test the proposed approach, a real DNA (deoxyribonucleic acid) dataset that consists of 90 million pairs has been used. The proposed model reduces the imbalance data sets from large-scale data sets without loss of its accuracy. CONCLUSIONS: The obtained results depict that MapReduce based K-NN classifier provided accurate results for big data of DNA

    Genetic determinants of clinical heterogeneity in sickle cell disease

    Get PDF
    L’anémie falciforme est une maladie monogénique causée par une mutation dans le locus de la β-globine. Malgré le fait que l’anémie falciforme soit une maladie monogénique, cette maladie présente une grande hétérogénéité clinique. On présume que des facteurs environnementaux et génétiques contribuent à cette hétérogénéité. Il a été observé qu’un haut taux d’hémoglobine fœtale (HbF) diminuait la sévérité et la mortalité des patients atteints de l’anémie falciforme. Le but de mon projet était d’identifier des variations génétiques modifiant la sévérité clinique de l’anémie falciforme. Dans un premier temps, nous avons effectué la cartographie-fine de trois régions précédemment associées avec le taux d’hémoglobine fœtale. Nous avons ensuite effectué des études d’association pan-génomiques avec deux complications cliniques de l’anémie falciforme ainsi qu’avec le taux d’hémoglobine fœtale. Hormis les régions déjà identifiées comme étant associées au taux d’hémoglobine fœtale, aucun locus n’a atteint le niveau significatif de la puce de génotypage. Pour identifier des groupes de gènes modérément associés au taux d’hémoglobine fœtale qui seraient impliqués dans de mêmes voies biologiques, nous avons effectué une étude des processus biologiques. Finalement, nous avons effectué l’analyse de 19 exomes de patients Jamaïcains ayant des complications cliniques mineures de l’anémie falciforme. Compte tenu de la taille des cohortes de réplication disponibles, nous n’avons pas les moyens de valider statistiquement les variations identifiées par notre étude. Cependant, nos résultats fournissent de bons gènes candidats pour des études fonctionnelles et pour les réplications futures. Nos résultats suggèrent aussi que le β-hydroxybutyrate en concentration endogène pourraient influencer le taux d’hémoglobine fœtale. De plus, nous montrons que la cartographie-fine des régions associées par des études pan-génomiques peut identifier des signaux d’association additionnels et augmenter la variation héritable expliquée par cette région.Sickle cell disease is a monogenic disease caused by a mutation in the β-globin locus. Although it is a monogenic disease, it shows a high clinical heterogeneity. Environmental and genetic factors are thought to play a role in this heterogeneity. It has been observed that a high fetal hemoglobin (HbF) levels correlates with a diminution of the severity and mortality of patients with sickle cell disease. The goal of my project was to identify genetic modifiers of the clinical severity of sickle cell disease. First, I performed the fine-mapping of three regions previously associated with HbF levels. Second, I performed genome-wide association studies with two clinical complications of sickle cell disease as well as with HbF levels. Since no new loci reached array-wide significance for HbF levels, I performed a pathway analysis to identify additional HbF loci of smaller effect size that might implicate shared biological processes. Finally, I performed the analysis of 19 whole exomes from Jamaican sickle cell disease patients with very mild complications. In conclusion, given the sample size of the replication cohorts available, we do not currently have the means to statistically validate the association signals. However, these results provide good candidate genes for functional studies and for future replication. Our results also suggest that β-hydroxybutyrate in endogenous levels could influence HbF levels. Furthermore, we show that fine-mapping the loci associated in genome-wide association studies can identify additional signals and increase the explained heritable variation

    Pertanika Journal of Science & Technology

    Get PDF

    Pertanika Journal of Science & Technology

    Get PDF
    corecore