2,386 research outputs found
Rank discriminants for predicting phenotypes from RNA expression
Statistical methods for analyzing large-scale biomolecular data are
commonplace in computational biology. A notable example is phenotype prediction
from gene expression data, for instance, detecting human cancers,
differentiating subtypes and predicting clinical outcomes. Still, clinical
applications remain scarce. One reason is that the complexity of the decision
rules that emerge from standard statistical learning impedes biological
understanding, in particular, any mechanistic interpretation. Here we explore
decision rules for binary classification utilizing only the ordering of
expression among several genes; the basic building blocks are then two-gene
expression comparisons. The simplest example, just one comparison, is the TSP
classifier, which has appeared in a variety of cancer-related discovery
studies. Decision rules based on multiple comparisons can better accommodate
class heterogeneity, and thereby increase accuracy, and might provide a link
with biological mechanism. We consider a general framework ("rank-in-context")
for designing discriminant functions, including a data-driven selection of the
number and identity of the genes in the support ("context"). We then specialize
to two examples: voting among several pairs and comparing the median expression
in two groups of genes. Comprehensive experiments assess accuracy relative to
other, more complex, methods, and reinforce earlier observations that simple
classifiers are competitive.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS738 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Characteristics-Informed Neural Networks for Forward and Inverse Hyperbolic Problems
We propose characteristic-informed neural networks (CINN), a simple and
efficient machine learning approach for solving forward and inverse problems
involving hyperbolic PDEs. Like physics-informed neural networks (PINN), CINN
is a meshless machine learning solver with universal approximation
capabilities. Unlike PINN, which enforces a PDE softly via a multi-part loss
function, CINN encodes the characteristics of the PDE in a general-purpose deep
neural network trained with the usual MSE data-fitting regression loss and
standard deep learning optimization methods. This leads to faster training and
can avoid well-known pathologies of gradient descent optimization of multi-part
PINN loss functions. If the characteristic ODEs can be solved exactly, which is
true in important cases, the output of a CINN is an exact solution of the PDE,
even at initialization, preventing the occurrence of non-physical outputs.
Otherwise, the ODEs must be solved approximately, but the CINN is still trained
only using a data-fitting loss function. The performance of CINN is assessed
empirically in forward and inverse linear hyperbolic problems. These
preliminary results indicate that CINN is able to improve on the accuracy of
the baseline PINN, while being nearly twice as fast to train and avoiding
non-physical solutions. Future extensions to hyperbolic PDE systems and
nonlinear PDEs are also briefly discussed
Classification and Error Estimation for Discrete Data
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior
Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?
There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work
- …