3,287 research outputs found

    Ensemble feature learning of genomic data using support vector machine

    Full text link
    © 2016 Anaissi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data

    Clustering and Alignment of Polymorphic Sequences for HLA-DRB1 Genotyping

    Get PDF
    Located on Chromosome 6p21, classical human leukocyte antigen genes are highly polymorphic. HLA alleles associate with a variety of phenotypes, such as narcolepsy, autoimmunity, as well as immunologic response to infectious disease. Moreover, high resolution genotyping of these loci is critical to achieving long-term survival of allogeneic transplants. Development of methods to obtain high resolution analysis of HLA genotypes will lead to improved understanding of how select alleles contribute to human health and disease risk. Genomic DNAs were obtained from a cohort of n = 383 subjects recruited as part of an Ulcerative Colitis study and analyzed for HLA-DRB1. HLA genotypes were determined using sequence specific oligonucleotide probes and by next-generation sequencing using the Roche/454 GSFLX instrument. The Clustering and Alignment of Polymorphic Sequences (CAPSeq) software application was developed to analyze next-generation sequencing data. The application generates HLA sequence specific 6-digit genotype information from next-generation sequencing data using MUMmer to align sequences and the R package diffusionMap to classify sequences into their respective allelic groups. The incorporation of Bootstrap Aggregating, Bagging to aid in sorting of sequences into allele classes resulted in improved genotyping accuracy. Using Bagging iterations equal to 60, the genotyping results obtained using CAPSeq when compared with sequence specific oligonucleotide probe characterized 4-digit genotypes exhibited high rates of concordance, matching at 759 out of 766 (99.1%) alleles. © 2013 Ringquist et al

    Classification hardness for supervised learners on 20 years of intrusion detection data

    Get PDF
    This article consolidates analysis of established (NSL-KDD) and new intrusion detection datasets (ISCXIDS2012, CICIDS2017, CICIDS2018) through the use of supervised machine learning (ML) algorithms. The uniformity in analysis procedure opens up the option to compare the obtained results. It also provides a stronger foundation for the conclusions about the efficacy of supervised learners on the main classification task in network security. This research is motivated in part to address the lack of adoption of these modern datasets. Starting with a broad scope that includes classification by algorithms from different families on both established and new datasets has been done to expand the existing foundation and reveal the most opportune avenues for further inquiry. After obtaining baseline results, the classification task was increased in difficulty, by reducing the available data to learn from, both horizontally and vertically. The data reduction has been included as a stress-test to verify if the very high baseline results hold up under increasingly harsh constraints. Ultimately, this work contains the most comprehensive set of results on the topic of intrusion detection through supervised machine learning. Researchers working on algorithmic improvements can compare their results to this collection, knowing that all results reported here were gathered through a uniform framework. This work's main contributions are the outstanding classification results on the current state of the art datasets for intrusion detection and the conclusion that these methods show remarkable resilience in classification performance even when aggressively reducing the amount of data to learn from

    Predicting Community Evolution in Social Networks

    Full text link
    Nowadays, sustained development of different social media can be observed worldwide. One of the relevant research domains intensively explored recently is analysis of social communities existing in social media as well as prediction of their future evolution taking into account collected historical evolution chains. These evolution chains proposed in the paper contain group states in the previous time frames and its historical transitions that were identified using one out of two methods: Stable Group Changes Identification (SGCI) and Group Evolution Discovery (GED). Based on the observed evolution chains of various length, structural network features are extracted, validated and selected as well as used to learn classification models. The experimental studies were performed on three real datasets with different profile: DBLP, Facebook and Polish blogosphere. The process of group prediction was analysed with respect to different classifiers as well as various descriptive feature sets extracted from evolution chains of different length. The results revealed that, in general, the longer evolution chains the better predictive abilities of the classification models. However, chains of length 3 to 7 enabled the GED-based method to almost reach its maximum possible prediction quality. For SGCI, this value was at the level of 3 to 5 last periods.Comment: Entropy 2015, 17, 1-x manuscripts; doi:10.3390/e170x000x 46 page

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi
    • …
    corecore