195,495 research outputs found
Development of Biomarkers Based on Diet-Dependent Metabolic Serotypes: Practical Issues in Development of Expert System-Based Classification Models in Metabolomic Studies
This is the publisher's official version, also available electronically from: http://online.liebertpub.com/doi/pdfplus/10.1089/omi.2004.8.197Dietary restriction (DR)-induced changes in the serum metabolome may be biomarkers for
physiological status (e.g., relative risk of developing age-related diseases such as cancer).
Megavariate analysis (unsupervised hierarchical cluster analysis IHCAJ; principal components
analysis [PCAJ) of serum metabolites reproducibly distinguish DR from ad libitum fed
rats. Component-based approaches (i.e., PCA) consistently perform as well as or better than
distance-based metrics (i.e., HCA). We therefore tested the following: (A) Do identified subsets
of serum metabolites contain sufficient information to construct mathematical models
of class membership (i.e., expert systems)? (B) Do component-based metrics out-perform
distance-based metrics? Testing was conducted using KNN (k-nearest neighbors, supervised
HCA) and SIMCA (soft independent modeling of class analogy, supervised PCA). Models
were built with single cohorts, combined cohorts or mixed samples from previously studied
cohorts as training sets. Both algorithms over-fit models based on single cohort training sets.
KNN models had >85% accuracy within training/test sets, but were unstable (i.e., values of
k could not be accurately set in advance). SIMCA models had 100% accuracy within all
training sets, 89% accuracy in test sets, did not appear to over-fit mixed cohort training sets,
and did not require post-hoc modeling adjustments. These data indicate that (i) previously
defined metabolites are robust enough to construct classification models (expert systems)
with SIMCA that can predict unknowns by dietary category; (ii) component-based analyses
outperformed distance-based metrics; (iii) use of over-fitting controls is essential; and (iv)
subtle inter-cohort variability may be a critical issue for high data density biomarker studies
that lack state markers
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that
combine unlabeled data with labeled data ? i.e. semi-supervised learning.
However, to the best of our knowledge, no study has been performed across
various techniques and different types and amounts of labeled and unlabeled
data. Moreover, most of the published work on semi-supervised learning
techniques assumes that the labeled and unlabeled data come from the same
distribution. It is possible for the labeling process to be associated with a
selection bias such that the distributions of data points in the labeled and
unlabeled sets are different. Not correcting for such bias can result in biased
function approximation with potentially poor performance. In this paper, we
present an empirical study of various semi-supervised learning techniques on a
variety of datasets. We attempt to answer various questions such as the effect
of independence or relevance amongst features, the effect of the size of the
labeled and unlabeled sets and the effect of noise. We also investigate the
impact of sample-selection bias on the semi-supervised learning techniques
under study and implement a bivariate probit technique particularly designed to
correct for such bias
Use of supervised machine learning for GNSS signal spoofing detection with validation on real-world meaconing and spoofing data : part I
The vulnerability of the Global Navigation Satellite System (GNSS) open service signals to spoofing and meaconing poses a risk to the users of safety-of-life applications. This risk consists of using manipulated GNSS data for generating a position-velocity-timing solution without the user's system being aware, resulting in presented hazardous misleading information and signal integrity deterioration without an alarm being triggered. Among the number of proposed spoofing detection and mitigation techniques applied at different stages of the signal processing, we present a method for the cross-correlation monitoring of multiple and statistically significant GNSS observables and measurements that serve as an input for the supervised machine learning detection of potentially spoofed or meaconed GNSS signals. The results of two experiments are presented, in which laboratory-generated spoofing signals are used for training and verification within itself, while two different real-world spoofing and meaconing datasets were used for the validation of the supervised machine learning algorithms for the detection of the GNSS spoofing and meaconing
Knowledge-based gene expression classification via matrix factorization
Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks.
Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.Siemens AG, MunichDFG (Graduate College 638)DAAD (PPP Luso - Alem˜a and PPP Hispano - Alemanas
Dimensionality reduction of clustered data sets
We present a novel probabilistic latent variable model to perform linear dimensionality reduction on data sets which contain clusters. We prove that the maximum likelihood solution of the model is an unsupervised generalisation of linear discriminant analysis. This provides a completely new approach to one of the most established and widely used classification algorithms. The performance of the model is then demonstrated on a number of real and artificial data sets
- …