87,651 research outputs found
State-of-the art data normalization methods improve NMR-based metabolomic analysis
Extracting biomedical information from large metabolomic datasets by multivariate data analysis is of considerable complexity. Common challenges include among others screening for differentially produced metabolites, estimation of fold changes, and sample classification. Prior to these analysis steps, it is important to minimize contributions from unwanted biases and experimental variance. This is the goal of data preprocessing. In this work, different data normalization methods were compared systematically employing two different datasets generated by means of nuclear magnetic resonance (NMR) spectroscopy. To this end, two different types of normalization methods were used, one aiming to remove unwanted sample-to-sample variation while the other adjusts the variance of the different metabolites by variable scaling and variance stabilization methods. The impact of all methods tested on sample classification was evaluated on urinary NMR fingerprints obtained from healthy volunteers and patients suffering from autosomal polycystic kidney disease (ADPKD). Performance in terms of screening for differentially produced metabolites was investigated on a dataset following a Latin-square design, where varied amounts of 8 different metabolites were spiked into a human urine matrix while keeping the total spike-in amount constant. In addition, specific tests were conducted to systematically investigate the influence of the different preprocessing methods on the structure of the analyzed data. In conclusion, preprocessing methods originally developed for DNA microarray analysis, in particular, Quantile and Cubic-Spline Normalization, performed best in reducing bias, accurately detecting fold changes, and classifying samples
Multi-domain stain normalization for digital pathology: A cycle-consistent adversarial network for whole slide images
The variation in histologic staining between different medical centers is one
of the most profound challenges in the field of computer-aided diagnosis. The
appearance disparity of pathological whole slide images causes algorithms to
become less reliable, which in turn impedes the wide-spread applicability of
downstream tasks like cancer diagnosis. Furthermore, different stainings lead
to biases in the training which in case of domain shifts negatively affect the
test performance. Therefore, in this paper we propose MultiStain-CycleGAN, a
multi-domain approach to stain normalization based on CycleGAN. Our
modifications to CycleGAN allow us to normalize images of different origins
without retraining or using different models. We perform an extensive
evaluation of our method using various metrics and compare it to commonly used
methods that are multi-domain capable. First, we evaluate how well our method
fools a domain classifier that tries to assign a medical center to an image.
Then, we test our normalization on the tumor classification performance of a
downstream classifier. Furthermore, we evaluate the image quality of the
normalized images using the Structural similarity index and the ability to
reduce the domain shift using the Fr\'echet inception distance. We show that
our method proves to be multi-domain capable, provides the highest image
quality among the compared methods, and can most reliably fool the domain
classifier while keeping the tumor classifier performance high. By reducing the
domain influence, biases in the data can be removed on the one hand and the
origin of the whole slide image can be disguised on the other, thus enhancing
patient data privacy.Comment: 19 pages, 11 figures, 3 table
A Framework for Evaluating Land Use and Land Cover Classification Using Convolutional Neural Networks
Analyzing land use and land cover (LULC) using remote sensing (RS) imagery is essential
for many environmental and social applications. The increase in availability of RS data has led to the
development of new techniques for digital pattern classification. Very recently, deep learning (DL)
models have emerged as a powerful solution to approach many machine learning (ML) problems.
In particular, convolutional neural networks (CNNs) are currently the state of the art for many image
classification tasks. While there exist several promising proposals on the application of CNNs to
LULC classification, the validation framework proposed for the comparison of different methods
could be improved with the use of a standard validation procedure for ML based on cross-validation
and its subsequent statistical analysis. In this paper, we propose a general CNN, with a fixed
architecture and parametrization, to achieve high accuracy on LULC classification over RS data
from different sources such as radar and hyperspectral. We also present a methodology to perform
a rigorous experimental comparison between our proposed DL method and other ML algorithms
such as support vector machines, random forests, and k-nearest-neighbors. The analysis carried out
demonstrates that the CNN outperforms the rest of techniques, achieving a high level of performance
for all the datasets studied, regardless of their different characteristics.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-
Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
In this article, we investigate the use of a probabilistic model for
unsupervised clustering in text collections. Unsupervised clustering has become
a basic module for many intelligent text processing applications, such as
information retrieval, text classification or information extraction. The model
considered in this contribution consists of a mixture of multinomial
distributions over the word counts, each component corresponding to a different
theme. We present and contrast various estimation procedures, which apply both
in supervised and unsupervised contexts. In supervised learning, this work
suggests a criterion for evaluating the posterior odds of new documents which
is more statistically sound than the "naive Bayes" approach. In an unsupervised
context, we propose measures to set up a systematic evaluation framework and
start with examining the Expectation-Maximization (EM) algorithm as the basic
tool for inference. We discuss the importance of initialization and the
influence of other features such as the smoothing strategy or the size of the
vocabulary, thereby illustrating the difficulties incurred by the high
dimensionality of the parameter space. We also propose a heuristic algorithm
based on iterative EM with vocabulary reduction to solve this problem. Using
the fact that the latent variables can be analytically integrated out, we
finally show that Gibbs sampling algorithm is tractable and compares favorably
to the basic expectation maximization approach
Statistically Motivated Second Order Pooling
Second-order pooling, a.k.a.~bilinear pooling, has proven effective for deep
learning based visual recognition. However, the resulting second-order networks
yield a final representation that is orders of magnitude larger than that of
standard, first-order ones, making them memory-intensive and cumbersome to
deploy. Here, we introduce a general, parametric compression strategy that can
produce more compact representations than existing compression techniques, yet
outperform both compressed and uncompressed second-order models. Our approach
is motivated by a statistical analysis of the network's activations, relying on
operations that lead to a Gaussian-distributed final representation, as
inherently used by first-order deep networks. As evidenced by our experiments,
this lets us outperform the state-of-the-art first-order and second-order
models on several benchmark recognition datasets.Comment: Accepted to ECCV 2018. Camera ready version. 14 page, 5 figures, 3
table
Best Practices in Convolutional Networks for Forward-Looking Sonar Image Recognition
Convolutional Neural Networks (CNN) have revolutionized perception for color
images, and their application to sonar images has also obtained good results.
But in general CNNs are difficult to train without a large dataset, need manual
tuning of a considerable number of hyperparameters, and require many careful
decisions by a designer. In this work, we evaluate three common decisions that
need to be made by a CNN designer, namely the performance of transfer learning,
the effect of object/image size and the relation between training set size. We
evaluate three CNN models, namely one based on LeNet, and two based on the Fire
module from SqueezeNet. Our findings are: Transfer learning with an SVM works
very well, even when the train and transfer sets have no classes in common, and
high classification performance can be obtained even when the target dataset is
small. The ADAM optimizer combined with Batch Normalization can make a high
accuracy CNN classifier, even with small image sizes (16 pixels). At least 50
samples per class are required to obtain test accuracy, and using
Dropout with a small dataset helps improve performance, but Batch Normalization
is better when a large dataset is available.Comment: Author version; IEEE/MTS Oceans 2017 Aberdee
- …