82 research outputs found
Learning a logistic regression with the help of unknown features at prediction stage
The use of features available at training time, but not
at prediction time, as additional information for training models
is known as learning using privileged information paradigm. In
this paper, the handling of privileged features is addressed from
the logistic regression perspective, commonly used in the clinical
setting. Two new proposals, LOGIT+ and LRPROB+, learned
with the influence of privileged features and preserving the
interpretability of conventional logistic regression, are proposed.
Experimental results on datasets report improvements of our
proposals over the performance of traditional logistic regression
learned without privileged information
Learning a Battery of COVID-19 Mortality Prediction Models by Multi-objective Optimization
The COVID-19 pandemic is continuously evolving with drastically changing epidemiological situations which are approached with different decisions: from the reduction of fatalities to even the selection of patients with the highest probability of survival in critical clinical situations. Motivated by this, a battery of mortality prediction models with different performances has been developed to assist physicians and hospital managers. Logistic regression, one of the most popular classifiers within the clinical field, has been chosen as the basis for the generation of our models. Whilst a standard logistic regression only learns a single model focusing on improving accuracy, we propose to extend the possibilities of logistic regression by focusing on sensitivity and specificity. Hence, the log-likelihood function, used to calculate the coefficients in the logistic model, is split into two objective functions: one representing the survivors and the other for the deceased class. A multi-objective optimization process is undertaken on both functions in order to find the Pareto set, composed of models not improved by another model in both objective functions simultaneously. The individual optimization of either sensitivity (deceased patients) or specificity (survivors) criteria may be conflicting objectives because the improvement of one can imply the worsening of the other. Nonetheless, this conflict guarantees the output of a battery of diverse prediction models. Furthermore, a specific methodology for the evaluation of the Pareto models is proposed. As a result, a battery of COVID-19 mortality prediction models is obtained to assist physicians in decision-making for specific epidemiological situations.This research is supported by the Basque Government (IT1504- 22, Elkartek) through the BERC 2022â2025 program and BMTF project, and by the Ministry of Science, Innovation and Universities: BCAM Severo Ochoa accreditation SEV-2017-0718 and PID2019-104966GB-I00. Furthermore, the work is also supported by the AXA Research Fund project âEarly prognosis of COVID-19 infections via machine learningâ
Heuristic Search over a Ranking for Feature Selection
In this work, we suggest a new feature selection technique that lets us use the wrapper approach for finding a well suited feature set for distinguishing experiment classes in high dimensional data sets. Our method is based on the relevance and redundancy idea, in the sense that a ranked-feature is chosen if additional information is gained by adding it. This heuristic leads to considerably better accuracy results, in comparison to the full set, and other representative feature selection algorithms in twelve wellâknown data sets, coupled with notable dimensionality reduction
HIV/antiretroviral therapyârelated lipodystrophy syndrome (HALS) is associated with higher RBP4 and lower omentin in plasma
AbstractVery little information is available on the involvement of newly characterized adipokines in human immunodeficiency virus (HIV)/antiretroviral therapy (ART)-associated lipodystrophy syndrome (HALS). Our aim was to determine whether apelin, apelin receptor, omentin, RBP4, vaspin and visfatin genetic variants and plasma levels are associated with HALS. We performed a cross-sectional multicentre study that involved 558 HIV type 1âinfected patients treated with a stable highly active ART regimen, 240 of which had overt HALS and 318 who did not have HALS. Epidemiologic and clinical variables were determined. Polymorphisms in the apelin, omentin, RBP4, vaspin and visfatin genes were assessed by genotyping. Plasma apelin, apelin receptor, omentin, RBP4, vaspin and visfatin levels were determined by enzyme-linked immunosorbent assay in 163 patients (81 with HALS and 82 without HALS) from whom stored plasma samples were available. Student's t test, one-way ANOVA, chi-square test, Pearson and Spearman correlations and linear regression analysis were used for statistical analyses. There were no associations between the different polymorphisms assessed and the HALS phenotype. Circulating RBP4 was significantly higher (p < 0.001) and plasma omentin was significantly lower (p 0.001) in patients with HALS compared to those without HALS; differences in plasma levels of the remaining adipokines were nonsignificant between groups. Circulating RBP4 concentration was predicted independently by the presence of HALS. Apelin and apelin receptor levels were independently predicted by body mass index. Visfatin concentration was predicted independently by the presence of acquired immunodeficiency syndrome. HALS is associated with higher RBP4 and lower omentin in plasma. These two adipokines, particularly RBP4, may be a link between HIV/ART and fat redistribution syndromes
A review of estimation of distribution algorithms in bioinformatics
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain
Regularized logistic regression and multi-objective variable selection for classifying MEG data
This paper addresses the question of maximizing classifier accuracy for classifying task-related mental activity from Magnetoencelophalography (MEG) data. We propose the use of different sources of information and introduce an automatic channel selection procedure. To determine an informative set of channels, our approach combines a variety of machine learning algorithms: feature subset selection methods, classifiers based on regularized logistic regression, information fusion, and multiobjective optimization based on probabilistic modeling of the search space. The experimental results show that our proposal is able to improve classification accuracy compared to approaches whose classifiers use only one type of MEG information or for which the set of channels is fixed a priori
Classification and biomarker identification using gene network modules and support vector machines
<p>Abstract</p> <p>Background</p> <p>Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes.</p> <p>We now demonstrate that an algorithm which integrates network information with recursive feature elimination based on SVM exhibits good performance and improves the biological interpretability of the results. We refer to the method as SVM with Recursive Network Elimination (SVM-RNE)</p> <p>Results</p> <p>Initially, one thousand genes selected by t-test from a training set are filtered so that only genes that map to a gene network database remain. The Gene Expression Network Analysis Tool (GXNA) is applied to the remaining genes to form <it>n </it>clusters of genes that are highly connected in the network. Linear SVM is used to classify the samples using these clusters, and a weight is assigned to each cluster based on its importance to the classification. The least informative clusters are removed while retaining the remainder for the next classification step. This process is repeated until an optimal classification is obtained.</p> <p>Conclusion</p> <p>More than 90% accuracy can be obtained in classification of selected microarray datasets by integrating the interaction network information with the gene expression information from the microarrays.</p> <p>The Matlab version of SVM-RNE can be downloaded from <url>http://web.macam.ac.il/~myousef</url></p
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data
<p>Abstract</p> <p>Background</p> <p>Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.</p> <p>We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.</p> <p>Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution.</p> <p>Results</p> <p>Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (<it>L</it><sub>1</sub>) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.</p> <p>Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations.</p> <p>Conclusions</p> <p>The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.</p> <p>The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'.</p> <p>We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.</p
Bioaccumulation of total mercury in the earthworm Eisenia andrei
Earthworms are a major part of the total biomass of soil fauna and play a vital role
in soil maintenance. They process large amounts of plant and soil material and can
accumulate many pollutants that may be present in the soil. Earthworms have been
explored as bioaccumulators for many heavy metal species such as Pb, Cu and Zn but
limited information is available for mercury uptake and bioaccumulation in earth-
worms and very few report on the factors that influence the kinetics of Hg uptake by
earthworms. It is known however that the uptake of Hg is strongly influenced by the
presence of organic matter, hence the influence of ligands are a major factor contribut
-
ing to the kinetics of mercury uptake in biosystems. In this work we have focused on
the uptake of mercury by earthworms (
Eisenia andrei
) in the presence of humic acid
(HA) under varying physical conditions of pH and temperature, done to assess the role
of humic acid in the bioaccumulation of mercury by earthworms from soils. The study
was conducted over a 5-day uptake period and all earthworm samples were analysed
by direct mercury analysis. Mercury distribution profiles as a function of time, bioac-
cumulation factors (BAFs), first order rate constants and body burden constants for
mercury uptake under selected conditions of temperature, pH as well as via the dermal
and gut route were evaluated in one comprehensive approach. The results showed
that the uptake of Hg was influenced by pH, temperature and the presence of HA.
Uptake of Hg
2
+
was improved at low pH and temperature when the earthworms in
soil were in contact with a saturating aqueous phase. The total amount of Hg
2
+
uptake
decreased from 75 to 48
% as a function of pH. For earthworms in dry soil, the uptake
was strongly influenced by the presence of the ligand. Calculated BAF values ranged
from 0.1 to 0.8. Mercury uptake typically followed first order kinetics with rate constants
determined as 0.2 to 1
h
?
1
.Scopus 201
Computational classifiers for predicting the short-term course of Multiple sclerosis
The aim of this study was to assess the diagnostic accuracy
(sensitivity and specificity) of clinical, imaging and motor evoked potentials
(MEP) for predicting the short-term prognosis of multiple sclerosis (MS).
METHODS: We obtained clinical data, MRI and MEP from a prospective cohort of 51
patients and 20 matched controls followed for two years. Clinical end-points
recorded were: 1) expanded disability status scale (EDSS), 2) disability
progression, and 3) new relapses. We constructed computational classifiers
(Bayesian, random decision-trees, simple logistic-linear regression-and neural
networks) and calculated their accuracy by means of a 10-fold cross-validation
method. We also validated our findings with a second cohort of 96 MS patients
from a second center. RESULTS: We found that disability at baseline, grey matter
volume and MEP were the variables that better correlated with clinical
end-points, although their diagnostic accuracy was low. However, classifiers
combining the most informative variables, namely baseline disability (EDSS), MRI
lesion load and central motor conduction time (CMCT), were much more accurate in
predicting future disability. Using the most informative variables (especially
EDSS and CMCT) we developed a neural network (NNet) that attained a good
performance for predicting the EDSS change. The predictive ability of the neural
network was validated in an independent cohort obtaining similar accuracy (80%)
for predicting the change in the EDSS two years later. CONCLUSIONS: The
usefulness of clinical variables for predicting the course of MS on an individual
basis is limited, despite being associated with the disease course. By training a
NNet with the most informative variables we achieved a good accuracy for
predicting short-term disability
- âŠ