305 research outputs found
CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data
For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data
is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html
Stability and aggregation of ranked gene lists
Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector
Evaluating Microarray-based Classifiers: An Overview
For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy
Restoring the full velocity field in the gaseous disk ofthe spiral galaxy NGC 157
We analyse the line-of-sight velocity field of ionized gas in the spiral
galaxy NGC 157 which has been obtained in the H\alpha emission at the 6m
telescope of SAO RAS. The existence of systematic deviations of the observed
gas velocities from pure circular motion is shown. A detailed investigation of
these deviations is undertaken by applying a Fourier analysis of the azimuthal
distributions of the line-of-sight velocities at different distances from the
galactic center. As a result of the analysis, all the main parameters of the
wave spiral pattern are determined: the corotation radius, the amplitudes and
phases of the gas velocity perturbations at different radii, and the velocity
of circular rotation of the disk corrected for the velocity perturbations due
to spiral arms. At a high confidence level, the presence of the two giant
anticyclones in the reference frame rotating with the spiral pattern is shown;
their sizes and the localization of their centers are consistent with the
results of the analytic theory and of numerical simulations. Besides the
anticyclones, the existence of cyclones in residual velocity fields of spiral
galaxies is predicted. In the reference frame rotating with the spiral pattern
these cyclones have to reveal themselves in galaxies where a radial gradient of
azimuthal residual velocity is steeper than that of the rotation velocity
(abridged).Comment: 23 pages including 25 eps-figures. Accepted for publication in A&
The Compact Group of Galaxies HCG 31 is in an early phase of merging
We have obtained high spectral resolution (R = 45900) Fabry-Perot velocity
maps of the Hickson Compact Group HCG 31 in order to revisit the important
problem of the merger nature of the central object A+C and to derive the
internal kinematics of the candidate tidal dwarf galaxies in this group. Our
main findings are: (1) double kinematic components are present throughout the
main body of A+C, which strongly suggests that this complex is an ongoing
merger (2) regions and E, to the east and south of complex A+C, present
rotation patterns with velocity amplitudes of and they
counterrotate with respect to A+C, (3) region F, which was previously thought
to be the best example of a tidal dwarf galaxy in HCG 31, presents no rotation
and negligible internal velocity dispersion, as is also the case for region
. HCG 31 presents an undergoing merger in its center (A+C) and it is likely
that it has suffered additional perturbations due to interactions with the
nearby galaxies B, G and Q.Comment: 5 pages + figures - Accepted to ApJ Lette
An AUC-based Permutation Variable Importance Measure for Random Forests
The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html
Giant Molecular Clouds in M33 - I. BIMA All Disk Survey
We present the first interferometric CO(J=1->0) map of the entire H-alpha
disk of M33. The 13" diameter synthesized beam corresponds to a linear
resolution of 50 pc, sufficient to distinguish individual giant molecular
clouds (GMCs). From these data we generated a catalog of 148 GMCs with an
expectation that no more than 15 of the sources are spurious. The catalog is
complete down to GMC masses of 1.5 X 10^5 M_sun and contains a total mass of
2.3 X 10^7 M_sun. Single dish observations of CO in selected fields imply that
our survey detects ~50% of the CO flux, hence that the total molecular mass of
M33 is 4.5 X 10^7 M_sun, approximately 2% of the HI mass. The GMCs in our
catalog are confined largely to the central region (R < 4 kpc). They show a
remarkable spatial and kinematic correlation with overdense HI filaments; the
geometry suggests that the formation of GMCs follows that of the filaments. The
GMCs exhibit a mass spectrum dN/dM ~ M^(-2.6 +/- 0.3), considerably steeper
than that found in the Milky Way and in the LMC. Combined with the total mass,
this steep function implies that the GMCs in M33 form with a characteristic
mass of 7 X 10^4 M_sun. More than 2/3 of the GMCs have associated HII regions,
implying that the GMCs have a short quiescent period. Our results suggest the
rapid assembly of molecular clouds from atomic gas, with prompt onset of
massive star formation.Comment: 19 pages, Accepted for Publication in the Astrophysical Journal
Supplemen
Stepwise classification of cancer samples using clinical and molecular data
<p>Abstract</p> <p>Background</p> <p>Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient.</p> <p>Results</p> <p>We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples.</p> <p>Conclusions</p> <p>Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package <it>stepwiseCM </it>and available at the Bioconductor website.</p
Clusters of Extragalactic Ultra Compact HII Regions
We report on the detection of optically thick free-free radio sources in the
galaxies M33, NGC 253, and NGC 6946 using data in the literature. We interpret
these sources as being young, embedded star birth regions, which are likely to
be clusters of ultracompact HII regions. All 35 of the sources presented in
this article have positive radio spectral indices alpha>0 suggesting an
optically thick thermal bremsstrahlung emission arising in the HII region
surrounding hot stars. Energy requirements indicate a range of a several to
>500 O7V star equivalents powering each HII region. Assuming a Salpeter IMF,
this corresponds to integrated stellar masses of 0.1--60,000 Msun. For roughly
half of the sources in our sample, there is no obvious optical counterpart,
giving further support for their deeply embedded nature. Their luminosities and
radio spectral energy distributions are consistent with HII regions having
electron densities from 1500 cm^-3 to 15000 cm^-3 and radii of 1 - 7 pc. We
suggest that the less luminous of these sources are extragalactic ultracompact
HII region complexes, those of intermediate luminosity are similar to W49 in
the Galaxy, while the brightest will be counterparts to 30 Doradus. These
objects constitute the lower mass range of extragalactic ``ultradense HII
regions'' which we argue are the youngest stages of massive star cluster
formation yet observed. This sample is beginning to fill in the continuum of
objects between small associations of ultracompact HII regions and the massive
extragalactic clusters that may evolve into globular clusters.Comment: 37 pages, uses AASTeX; scheduled to appear in ApJ v. 559 October
2001. Full postscript version available from
http://www.astro.wisc.edu/~chip/Papers/Johnson_Kobulnicky_etal_ApJ559.ps.g
- …