105 research outputs found
Large-scale Nonlinear Variable Selection via Kernel Random Features
We propose a new method for input variable selection in nonlinear regression.
The method is embedded into a kernel regression machine that can model general
nonlinear functions, not being a priori limited to additive models. This is the
first kernel-based variable selection method applicable to large datasets. It
sidesteps the typical poor scaling properties of kernel methods by mapping the
inputs into a relatively low-dimensional space of random features. The
algorithm discovers the variables relevant for the regression task together
with learning the prediction model through learning the appropriate nonlinear
random feature maps. We demonstrate the outstanding performance of our method
on a set of large-scale synthetic and real datasets.Comment: Final version for proceedings of ECML/PKDD 201
Analyzing sensory data using non-linear preference learning with feature subset selection
15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004The quality of food can be assessed from different points of view. In this paper, we deal with those aspects that can be appreciated through sensory impressions. When we are aiming to induce a function that maps object descriptions into ratings, we must consider that consumers’ ratings are just a way to express their preferences about the products presented in the same testing session. Therefore, we postulate to learn from consumers’ preference judgments instead of using an approach based on regression. This requires the use of special purpose kernels and feature subset selection methods. We illustrate the benefits of our approach in two families of real-world data base
Evolving training sets for improved transfer learning in brain computer interfaces
A new proof-of-concept method for optimising the performance of Brain Computer Interfaces (BCI) while minimising the quantity of required training data is introduced. This is achieved by using an evolutionary approach to rearrange the distribution of training instances, prior to the construction of an Ensemble Learning Generic Information (ELGI) model. The training data from a population was optimised to emphasise generality of the models derived from it, prior to a re-combination with participant-specific data via the ELGI approach, and training of classifiers. Evidence is given to support the adoption of this approach in the more difficult BCI conditions: smaller training sets, and those suffering from temporal drift. This paper serves as a case study to lay the groundwork for further exploration of this approach
ABCD Neurocognitive Prediction Challenge 2019: Predicting individual fluid intelligence scores from structural MRI using probabilistic segmentation and kernel ridge regression
We applied several regression and deep learning methods to predict fluid
intelligence scores from T1-weighted MRI scans as part of the ABCD
Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel
intensities and probabilistic tissue-type labels derived from these as features
to train the models. The best predictive performance (lowest mean-squared
error) came from Kernel Ridge Regression (KRR; ), which produced a
mean-squared error of 69.7204 on the validation set and 92.1298 on the test
set. This placed our group in the fifth position on the validation leader board
and first place on the final (test) leader board.Comment: Winning entry in the ABCD Neurocognitive Prediction Challenge at
MICCAI 2019. 7 pages plus references, 3 figures, 1 tabl
Feature selection for chemical sensor arrays using mutual information
We address the problem of feature selection for classifying a diverse set of chemicals using an array of metal oxide sensors. Our aim is to evaluate a filter approach to feature selection with reference to previous work, which used a wrapper approach on the same data set, and established best features and upper bounds on classification performance. We selected feature sets that exhibit the maximal mutual information with the identity of the chemicals. The selected features closely match those found to perform well in the previous study using a wrapper approach to conduct an exhaustive search of all permitted feature combinations. By comparing the classification performance of support vector machines (using features selected by mutual information) with the performance observed in the previous study, we found that while our approach does not always give the maximum possible classification performance, it always selects features that achieve classification performance approaching the optimum obtained by exhaustive search. We performed further classification using the selected feature set with some common classifiers and found that, for the selected features, Bayesian Networks gave the best performance. Finally, we compared the observed classification performances with the performance of classifiers using randomly selected features. We found that the selected features consistently outperformed randomly selected features for all tested classifiers. The mutual information filter approach is therefore a computationally efficient method for selecting near optimal features for chemical sensor arrays
Benchopt: Reproducible, efficient and collaborative optimization benchmarks
Numerical validation is at the core of machine learning research as it allows
to assess the actual impact of new methods, and to confirm the agreement
between theory and practice. Yet, the rapid development of the field poses
several challenges: researchers are confronted with a profusion of methods to
compare, limited transparency and consensus on best practices, as well as
tedious re-implementation work. As a result, validation is often very partial,
which can lead to wrong conclusions that slow down the progress of research. We
propose Benchopt, a collaborative framework to automate, reproduce and publish
optimization benchmarks in machine learning across programming languages and
hardware architectures. Benchopt simplifies benchmarking for the community by
providing an off-the-shelf tool for running, sharing and extending experiments.
To demonstrate its broad usability, we showcase benchmarks on three standard
learning tasks: -regularized logistic regression, Lasso, and ResNet18
training for image classification. These benchmarks highlight key practical
findings that give a more nuanced view of the state-of-the-art for these
problems, showing that for practical evaluation, the devil is in the details.
We hope that Benchopt will foster collaborative work in the community hence
improving the reproducibility of research findings.Comment: Accepted in proceedings of NeurIPS 22; Benchopt library documentation
is available at https://benchopt.github.io
Enhanced protein fold recognition through a novel data integration approach
<p>Abstract</p> <p>Background</p> <p>Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources.</p> <p>Results</p> <p>In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as <it>MKLdiv-dc </it>and <it>MKLdiv-conv</it>. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem.</p> <p>Conclusion</p> <p>Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.</p
Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network
Many complex disease syndromes, such as asthma, consist of a large number of highly related, rather than independent, clinical or molecular phenotypes. This raises a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this study, we propose a new statistical framework called graph-guided fused lasso (GFlasso) to directly and effectively incorporate the correlation structure of multiple quantitative traits such as clinical metrics and gene expressions in association analysis. Our approach represents correlation information explicitly among the quantitative traits as a quantitative trait network (QTN) and then leverages this network to encode structured regularization functions in a multivariate regression model over the genotypes and traits. The result is that the genetic markers that jointly influence subgroups of highly correlated traits can be detected jointly with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently and combined the results afterwards, our approach analyzes all of the traits jointly in a single statistical framework. This allows our method to borrow information across correlated phenotypes to discover the genetic markers that perturb a subset of the correlated traits synergistically. Using simulated datasets based on the HapMap consortium and an asthma dataset, we compared the performance of our method with other methods based on single-marker analysis and regression-based methods that do not use any of the relational information in the traits. We found that our method showed an increased power in detecting causal variants affecting correlated traits. Our results showed that, when correlation patterns among traits in a QTN are considered explicitly and directly during a structured multivariate genome association analysis using our proposed methods, the power of detecting true causal SNPs with possibly pleiotropic effects increased significantly without compromising performance on non-pleiotropic SNPs
Recommended from our members
Morphological segmentation analysis and texture-based support vector machines classification on mice liver fibrosis microscopic images
Background To reduce the intensity of the work of doctors, pre-classification work needs to be issued. In this paper, a novel and related liver microscopic image classification analysis method is proposed. Objective For quantitative analysis, segmentation is carried out to extract the quantitative information of special organisms in the image for further diagnosis, lesion localization, learning and treating anatomical abnormalities and computer-guided surgery. Methods in the current work, entropy based features of microscopic fibrosis mice’ liver images were analyzed using fuzzy c-cluster, k-means and watershed algorithms based on distance transformations and gradient. A morphological segmentation based on a local threshold was deployed to determine the fibrosis areas of images. Results the segmented target region using the proposed method achieved high effective microscopy fibrosis images segmenting of mice liver in terms of the running time, dice ratio and precision. The image classification experiments were conducted using Gray Level Co-occurrence Matrix (GLCM). The best classification model derived from the established characteristics was GLCM which performed the highest accuracy of classification using a developed Support Vector Machine (SVM). The training model using 11 features was found to be as accurate when only trained by 8 GLCMs. Conclusion The research illustrated the proposed method is a new feasible research approach for microscopy mice liver image segmentation and classification using intelligent image analysis techniques. It is also reported that the average computational time of the proposed approach was only 2.335 seconds, which outperformed other segmentation algorithms with 0.8125 dice ratio and 0.5253 precision
- …