Search CORE

24 research outputs found

Effective selection of informative SNPs and classification on the HapMap genotype data

Author: A Gusev
B Halldrsson
B Wu
E Halperin
I Guyon
I Levner
J Devore
J Jaeger
J Park
L Wang
Lipo Wang
LP Wang
LP Wang
M Stephens
NA Rosenberg
NA Rosenberg
Nina Zhou
R Tibshirani
S Wright
TM Phuong
V Bafna
V Vapnik
WM Trochim
Y Su
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park <it>et al.</it> (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. Results In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park <it>et al.</it>, we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. Conclusion Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.</p

Crossref

Directory of Open Access Journals

PubMed Central

DR-NTU (Digital Repository of NTU)

Utilising flow aggregation to classify benign imitating attacks

Author: Atkinson Robert
Bayne Ethan
Bellekens Xavier
Bures Miroslav
Hindy Hanan
Tachtatzis Christos
Publication venue
Publication date: 01/03/2021
Field of study

Cyber-attacks continue to grow, both in terms of volume and sophistication. This is aided by an increase in available computational power, expanding attack surfaces, and advancements in the human understanding of how to make attacks undetectable. Unsurprisingly, machine learning is utilised to defend against these attacks. In many applications, the choice of features is more important than the choice of model. A range of studies have, with varying degrees of success, attempted to discriminate between benign traffic and well-known cyber-attacks. The features used in these studies are broadly similar and have demonstrated their effectiveness in situations where cyber-attacks do not imitate benign behaviour. To overcome this barrier, in this manuscript, we introduce new features based on a higher level of abstraction of network traffic. Specifically, we perform flow aggregation by grouping flows with similarities. This additional level of feature abstraction benefits from cumulative information, thus qualifying the models to classify cyber-attacks that mimic benign traffic. The performance of the new features is evaluated using the benchmark CICIDS2017 dataset, and the results demonstrate their validity and effectiveness. This novel proposal will improve the detection accuracy of cyber-attacks and also build towards a new direction of feature extraction for complex ones

arXiv.org e-Print Archive

Abertay Research Portal

University of Strathclyde Institutional Repository

Directory of Open Access Journals

CGTS: a site-clustering graph based tagSNP selection algorithm in genotype data

Author: Guo Mao-zu
Wang Chun-yu
Wang Jun
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

A random forest approach to the detection of epistatic interactions in case-control studies

Author: A Bureau
A Collins
AG Heidema
AM Glazier
BA McKinney
CT Tsai
E Lander
HC Fung
J Hoh
J Marchini
J Millstein
J Simon-Sanchez
JH Moore
JK Pritchard
L Breiman
L Kruglyak
L Tiret
MD Ritchie
MP Martin
MR Nelson
N Chatterjee
NJ Risch
R Culverhouse
R Diaz-Uriarte
R Jiang
R Jiang
RJ Klein
RO Duda
Rui Jiang
SM Williams
TM Phuong
Wanwan Tang
Wenhui Fu
X Chen
Xuebing Wu
Y Ye
Y Zhang
YM Cho
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates. Results We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease. Conclusion Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning

Author: Bergsma Rob
Gianola Daniel
Gilbert Hélène
Piles Miriam
Tusell Llibertat
Publication venue: 'Frontiers Media SA'
Publication date: 22/02/2021
Field of study

Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.info:eu-repo/semantics/publishedVersio

IRTA Pubpro

Risk estimation and risk prediction using machine-learning methods

Author: Andreas Ziegler
Inke R. König
Jochen Kruppa
Publication venue: Springer Nature
Publication date: 01/01/2012
Field of study

After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00439-012-1194-y) contains supplementary material, which is available to authorized users

Springer - Publisher Connector

PubMed Central

Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease

Author: Stokes ME
Visweswaran S
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2012
Field of study

Background: Identification of genetic variants that are associated with disease is an important goal in elucidating the genetic causes of diseases. The genetic patterns that are associated with common diseases are complex and may involve multiple interacting genetic variants. The Relief family of algorithms is a powerful tool for efficiently identifying genetic variants that are associated with disease, even if the variants have nonlinear interactions without significant main effects. Many variations of Relief have been developed over the past two decades and several of them have been applied to single nucleotide polymorphism (SNP) data. Results: We developed a new spatially weighted variation of Relief called Sigmoid Weighted ReliefF Star (SWRF*), and applied it to synthetic SNP data. When compared to ReliefF and SURF*, which are two algorithms that have been applied to SNP data for identifying interactions, SWRF* had significantly greater power. Furthermore, we developed a framework called the Modular Relief Framework (MoRF) that can be used to develop novel variations of the Relief algorithm, and we used MoRF to develop the SWRF* algorithm. Conclusions: MoRF allows easy development of new Relief algorithms by specifying different interchangeable functions for the component terms. Using MORF, we developed a new Relief algorithm called SWRF* that had greater ability to identify interacting genetic variants in synthetic data compared to existing Relief algorithms. © 2012 Stokes and Visweswaran.; licensee BioMed Central Ltd

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

D-Scholarship@Pitt

Genotype imputation as a cost-saving genomic strategy for South African Sanga cattle: A review

Author: Lashmar S.F.
Muchadeyi F.C.
Visser C
Publication venue: 'African Journals Online (AJOL)'
Publication date: 11/04/2019
Field of study

The South African beef cattle population is heterogeneous and consists of a variety of breeds, production systems and breeding goals. Indigenous cattle breeds are uniquely adapted to their native surroundings, necessitating conservation of these breeds as usable genetic resources to sustain efficient production of beef. Current projections indicate positive growth in human population size, with parallel growth in nutritional demand, in the midst of intensifying environmental conditions. Sanga cattle, therefore, are invaluable assets to the South African beef industry. Modern genomic methodologies allow for an extensive insight into the genome architecture of local breeds. The evolution of these methodologies has also provided opportunities to incorporate deoxyribonucleic acid (DNA) information into breed improvement programs in the form of genomic selection (GS). Certain challenges, such as the high cost of generating adequate numbers of dense genotypic profiles and the introduction of ascertainment bias when non-commercial breeds are genotyped with commercial single nucleotide polymorphism (SNP) panels, have caused a lag in progress on the genomics front in South Africa. Genotype imputation is a statistical method that infers unavailable or missing genotypic data based on shared haplotypes within a population using a population or breed representative reference sample. Genotypes are generated in silico, providing an animal with genotypic information for SNP markers that were not genotyped, based on predictive model-based algorithms. The validation of this method for indigenous breeds will enable the development of cost-effective low-density bead chips, allowing more animals to be genotyped, and imputation to high-density information. The improvement in SNP densities, at lower cost, will allow enhanced power in genome-wide association studies (GWAS) and genomic estimated breeding value (GEBV)-based selection for these breeds. To fully reap the benefits of this methodology, however, will require the setting up of accurate and reliable frameworks that are optimized for its application in Sanga breeds. This review paper aims, first, to identify the challenges that have been impeding genomic applications for Sanga cattle and second, to outline the advantages that a method such as genotype imputation might provide.Keywords: breed improvement, developing countries, indigenous breeds, genomic

AJOL - African Journals Online

UPSpace at the University of Pretoria

A Supervised Auto-Tuning Approach for a Banking Fraud Detection System

Author: Carminati Michele
Valentini Luca
Zanero Stefano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref