Search CORE

9 research outputs found

Fuzzy-FishNET: a highly reproducible protein complex-based approach for feature selection in comparative proteomics

Author
Publication venue: BioMed Central
Publication date: 05/12/2016
Field of study

Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction

Author: Alaiz-Rodríguez Rocío
Fernandez-Navarro Pablo L
García-Ordás María Teresa
López Nahúm Cueto
Palazuelos Camilo
Vitelli-Storelli Facundo
Publication venue: 'MDPI AG'
Publication date: 01/10/2021
Field of study

This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.The study was partially funded by the “Accion Transversal del Cancer”, approved on the Spanish Ministry Council on the 11th October 2007, by the Instituto de Salud Carlos III-FEDER (PI08/1770, PI08/0533, PI08/1359, PS09/00773, PS09/01286, PS09/01903, PS09/02078, PS09/01662, PI11/01403, PI11/01889, PI11/00226, PI11/01810, PI11/02213, PI12/00488, PI12/00265, PI12/01270, PI12/00715, PI12/00150), by the Fundación Marqués de Valdecilla (API 10/09), by the ICGC International Cancer Genome Consortium CLL, by the Junta de Castilla y León (LE22A10-2), by the Consejería de Salud of the Junta de Andalucía (PI-0571), by the Conselleria de Sanitat of the Generalitat Valenciana (AP 061/10), by the Recercaixa (2010ACUP 00310), by the Regional Government of the Basque Country by European Commission grants FOOD-CT- 2006-036224- HIWATE, by the Spanish Association Against Cancer (AECC) Scientific Foundation, by the The Catalan Government DURSI grant 2009SGR1489. Samples: Biological samples were stored at the Parc de Salut MAR Biobank (MARBiobanc; Barcelona) which is supported by Instituto de Salud Carlos III FEDER (RD09/0076/00036). Furthermore, at the Public Health Laboratory from Gipuzkoa and the Basque Biobank. Furthermore, sample collection was supported by the Xarxa de Bancs de Tumors de Catalunya sponsored by Pla Director d’Oncologia de Catalunya (XBTC). Biological samples were stored at the “Biobanco La Fe” which is supported by Instituto de Salud Carlos III (RD 09 0076/00021) and FISABIO biobanking, which is supported by Instituto de Salud Carlos III (RD09 0076/00058).S

Multidisciplinary Digital Publishing Institute

UCrea

Directory of Open Access Journals

REPISALUD

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning

Author: Bergsma Rob
Gianola Daniel
Gilbert Hélène
Piles Miriam
Tusell Llibertat
Publication venue: 'Frontiers Media SA'
Publication date: 22/02/2021
Field of study

Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.info:eu-repo/semantics/publishedVersio

IRTA Pubpro

Ensembles for feature selection: A review and future trends

Author: Alonso-Betanzos Amparo
Bolón-Canedo Verónica
Publication venue: Elsevier
Publication date: 01/01/2019
Field of study

© 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article: Bolón-Canedo, V. and Alonso-Betanzos, A. (2019) ‘Ensembles for Feature Selection: A Review and Future Trends’ has been accepted for publication in: Information Fusion, 52, pp. 1–12. The Version of Record is available online at https://doi.org/10.1016/j.inffus.2018.11.008.[Abstract]: Ensemble learning is a prolific field in Machine Learning since it is based on the assumption that combining the output of multiple models is better than using a single model, and it usually provides good results. Normally, it has been commonly employed for classification, but it can be used to improve other disciplines such as feature selection. Feature selection consists of selecting the relevant features for a problem and discard those irrelevant or redundant, with the main goal of improving classification accuracy. In this work, we provide the reader with the basic concepts necessary to build an ensemble for feature selection, as well as reviewing the up-to-date advances and commenting on the future trends that are still to be faced.This research has been financially supported in part by the Spanish Ministerio de Economa y Competitividad (research project TIN 2015-65069-C2-1-R), by the Xunta de Galicia (research projects GRC2014/035 and the Centro Singular de Investigación de Galicia, accreditation 2016–2019, Ref. ED431G/01) and by the European Union (FEDER/ERDF).Xunta de Galicia; GRC2014/035Xunta de Galicia; ED431G/0

Repositorio da Universidade da Coruña

A Machine Learning Approach for Lamb Meat Quality Assessment Using FTIR Spectra

Author: Alaiz-Rodriguez Rocio
Parnell Andrew C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

The food industry requires automatic methods to establish authenticity of food products. In this work, we address the problem of the certification of suckling lamb meat with respect to the rearing system. We evaluate the performance of neural network classifiers as well as different dimensionality reduction techniques, with the aim of categorizing lamb fat by means of spectroscopy and analysing the features with more discrimination power. Assessing the stability of feature ranking algorithms also becomes particularly important. We assess six feature selection techniques (χ 2 , Information Gain, Gain Ratio, Relief and two embedded techniques based on the decision rule 1R and SVM (Support Vector Machine). Additionally, we compare them with common approaches in the chemometrics field like the Partial Least Square (PLS) model and Principal Component Analysis (PCA) regression. Experimental results with a fat sample dataset collected from carcasses of suckling lambs show that performing feature selection contributes to classification performance increasing accuracy from 89.70% with the full feature set to 91.80% and 93.89% with the SVM approach and PCA, respectively. Moreover, the neural classifiers yield a significant increase in the accuracy with respect to the PLS model (85.60% accuracy). It is noteworthy that unlike PCA or PLS, the feature selection techniques that select relevant wavelengths allow the user to identify the regions in the spectrum with the most discriminant power, which makes the understanding of this process easier for veterinary experts. The robustness of the feature selection methods is assessed via a visual approach

MURAL - Maynooth University Research Archive Library

Maynooth University ePrints and eTheses Archive

NUI Maynooth Eprint Archive

GFS: fuzzy preprocessing for effective gene expression analysis

Author: Abha Belorkar
C Cheadle
D Soh
EJ Yeoh
J Luo
JM Raser
JN Haslett
JT Leek
K Lim
L Geistlinger
L Shi
Limsoon Wong
M Pescatori
ME Ross
PJ Rousseeuw
SA Armstrong
TR Golub
WWB Goh
WWB Goh
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

Evolutionary and deep mining models for effective biomarker discovery

Author: Alzubaidi AHA
Publication venue
Publication date: 01/09/2019
Field of study

With the advent of high-throughput biology, large amounts of molecular data are available for purposeful analysis and evaluation. Extracting relevant knowledge from high-throughput biomedical datasets has become a common goal of current approaches to personalised cancer medicine and understanding cancer genotype and phenotype. However, the datasets are characterised by high dimensionality and relatively small sample sizes with small signal-to-noise ratios. Extracting and interpreting relevant knowledge from such complex datasets therefore remains a significant challenge for the fields of machine learning and data mining. This is evidenced by the limited success these methods have had in detecting robust and reliable biomarkers for cancers and other complicated diseases. This could also explain the lack of finding generic biomarkers among the identified published genes for identical diseases or clinical conditions. This thesis proposes and evaluates the efficacy of two novel feature mining models established on the basis of the evolutionary computation and deep learning paradigms to position and solve biomarker discovery as an optimisation problem. Deep learning methods lack the transparency and interpretability found in the evolutionary paradigm. To overcome the inherent issue of poor explanatory power associated with the deep learning, this research also introduces a novel deep mining model that helps to deconstruct the internal state of such deep learning models to reveal key determinants underlying its latent representations to aid feature selection. As a result, salient biomarkers for breast cancer and the positivity of the Estrogen and Progesterone receptors are discovered robustly and validated reliably across a wide range of independently generated breast cancer data samples

Nottingham Trent Institutional Repository (IRep)