24 research outputs found

    Effective selection of informative SNPs and classification on the HapMap genotype data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park <it>et al.</it> (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations.</p> <p>Results</p> <p>In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park <it>et al.</it>, we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs.</p> <p>Conclusion</p> <p>Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.</p

    Utilising flow aggregation to classify benign imitating attacks

    Get PDF
    Cyber-attacks continue to grow, both in terms of volume and sophistication. This is aided by an increase in available computational power, expanding attack surfaces, and advancements in the human understanding of how to make attacks undetectable. Unsurprisingly, machine learning is utilised to defend against these attacks. In many applications, the choice of features is more important than the choice of model. A range of studies have, with varying degrees of success, attempted to discriminate between benign traffic and well-known cyber-attacks. The features used in these studies are broadly similar and have demonstrated their effectiveness in situations where cyber-attacks do not imitate benign behaviour. To overcome this barrier, in this manuscript, we introduce new features based on a higher level of abstraction of network traffic. Specifically, we perform flow aggregation by grouping flows with similarities. This additional level of feature abstraction benefits from cumulative information, thus qualifying the models to classify cyber-attacks that mimic benign traffic. The performance of the new features is evaluated using the benchmark CICIDS2017 dataset, and the results demonstrate their validity and effectiveness. This novel proposal will improve the detection accuracy of cyber-attacks and also build towards a new direction of feature extraction for complex ones

    A random forest approach to the detection of epistatic interactions in case-control studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.</p> <p>Results</p> <p>We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.</p> <p>Conclusion</p> <p>Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p

    Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning

    Get PDF
    Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.info:eu-repo/semantics/publishedVersio

    Risk estimation and risk prediction using machine-learning methods

    Get PDF
    After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00439-012-1194-y) contains supplementary material, which is available to authorized users

    Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease

    Get PDF
    Background: Identification of genetic variants that are associated with disease is an important goal in elucidating the genetic causes of diseases. The genetic patterns that are associated with common diseases are complex and may involve multiple interacting genetic variants. The Relief family of algorithms is a powerful tool for efficiently identifying genetic variants that are associated with disease, even if the variants have nonlinear interactions without significant main effects. Many variations of Relief have been developed over the past two decades and several of them have been applied to single nucleotide polymorphism (SNP) data. Results: We developed a new spatially weighted variation of Relief called Sigmoid Weighted ReliefF Star (SWRF*), and applied it to synthetic SNP data. When compared to ReliefF and SURF*, which are two algorithms that have been applied to SNP data for identifying interactions, SWRF* had significantly greater power. Furthermore, we developed a framework called the Modular Relief Framework (MoRF) that can be used to develop novel variations of the Relief algorithm, and we used MoRF to develop the SWRF* algorithm. Conclusions: MoRF allows easy development of new Relief algorithms by specifying different interchangeable functions for the component terms. Using MORF, we developed a new Relief algorithm called SWRF* that had greater ability to identify interacting genetic variants in synthetic data compared to existing Relief algorithms. © 2012 Stokes and Visweswaran.; licensee BioMed Central Ltd

    Genotype imputation as a cost-saving genomic strategy for South African Sanga cattle: A review

    Get PDF
    The South African beef cattle population is heterogeneous and consists of a variety of breeds, production systems and breeding goals. Indigenous cattle breeds are uniquely adapted to their native surroundings, necessitating conservation of these breeds as usable genetic resources to sustain efficient production of beef. Current projections indicate positive growth in human population size, with parallel growth in nutritional demand, in the midst of intensifying environmental conditions. Sanga cattle, therefore, are invaluable assets to the South African beef industry. Modern genomic methodologies allow for an extensive insight into the genome architecture of local breeds. The evolution of these methodologies has also provided opportunities to incorporate deoxyribonucleic acid (DNA) information into breed improvement programs in the form of genomic selection (GS). Certain challenges, such as the high cost of generating adequate numbers of dense genotypic profiles and the introduction of ascertainment bias when non-commercial breeds are genotyped with commercial single nucleotide polymorphism (SNP) panels, have caused a lag in progress on the genomics front in South Africa. Genotype imputation is a statistical method that infers unavailable or missing genotypic data based on shared haplotypes within a population using a population or breed representative reference sample. Genotypes are generated in silico, providing an animal with genotypic information for SNP markers that were not genotyped, based on predictive model-based algorithms. The validation of this method for indigenous breeds will enable the development of cost-effective low-density bead chips, allowing more animals to be genotyped, and imputation to high-density information. The improvement in SNP densities, at lower cost, will allow enhanced power in genome-wide association studies (GWAS) and genomic estimated breeding value (GEBV)-based selection for these breeds. To fully reap the benefits of this methodology, however, will require the setting up of accurate and reliable frameworks that are optimized for its application in Sanga breeds. This review paper aims, first, to identify the challenges that have been impeding genomic applications for Sanga cattle and second, to outline the advantages that a method such as genotype imputation might provide.Keywords: breed improvement, developing countries, indigenous breeds, genomic
    corecore