305 research outputs found

    CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data

    Get PDF
    For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    Evaluating Microarray-based Classifiers: An Overview

    Get PDF
    For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy

    Restoring the full velocity field in the gaseous disk ofthe spiral galaxy NGC 157

    Get PDF
    We analyse the line-of-sight velocity field of ionized gas in the spiral galaxy NGC 157 which has been obtained in the H\alpha emission at the 6m telescope of SAO RAS. The existence of systematic deviations of the observed gas velocities from pure circular motion is shown. A detailed investigation of these deviations is undertaken by applying a Fourier analysis of the azimuthal distributions of the line-of-sight velocities at different distances from the galactic center. As a result of the analysis, all the main parameters of the wave spiral pattern are determined: the corotation radius, the amplitudes and phases of the gas velocity perturbations at different radii, and the velocity of circular rotation of the disk corrected for the velocity perturbations due to spiral arms. At a high confidence level, the presence of the two giant anticyclones in the reference frame rotating with the spiral pattern is shown; their sizes and the localization of their centers are consistent with the results of the analytic theory and of numerical simulations. Besides the anticyclones, the existence of cyclones in residual velocity fields of spiral galaxies is predicted. In the reference frame rotating with the spiral pattern these cyclones have to reveal themselves in galaxies where a radial gradient of azimuthal residual velocity is steeper than that of the rotation velocity (abridged).Comment: 23 pages including 25 eps-figures. Accepted for publication in A&

    The Compact Group of Galaxies HCG 31 is in an early phase of merging

    Full text link
    We have obtained high spectral resolution (R = 45900) Fabry-Perot velocity maps of the Hickson Compact Group HCG 31 in order to revisit the important problem of the merger nature of the central object A+C and to derive the internal kinematics of the candidate tidal dwarf galaxies in this group. Our main findings are: (1) double kinematic components are present throughout the main body of A+C, which strongly suggests that this complex is an ongoing merger (2) regions A2A2 and E, to the east and south of complex A+C, present rotation patterns with velocity amplitudes of 25kms1\sim 25 km s^{-1} and they counterrotate with respect to A+C, (3) region F, which was previously thought to be the best example of a tidal dwarf galaxy in HCG 31, presents no rotation and negligible internal velocity dispersion, as is also the case for region A1A1. HCG 31 presents an undergoing merger in its center (A+C) and it is likely that it has suffered additional perturbations due to interactions with the nearby galaxies B, G and Q.Comment: 5 pages + figures - Accepted to ApJ Lette

    An AUC-based Permutation Variable Importance Measure for Random Forests

    Get PDF
    The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

    Giant Molecular Clouds in M33 - I. BIMA All Disk Survey

    Full text link
    We present the first interferometric CO(J=1->0) map of the entire H-alpha disk of M33. The 13" diameter synthesized beam corresponds to a linear resolution of 50 pc, sufficient to distinguish individual giant molecular clouds (GMCs). From these data we generated a catalog of 148 GMCs with an expectation that no more than 15 of the sources are spurious. The catalog is complete down to GMC masses of 1.5 X 10^5 M_sun and contains a total mass of 2.3 X 10^7 M_sun. Single dish observations of CO in selected fields imply that our survey detects ~50% of the CO flux, hence that the total molecular mass of M33 is 4.5 X 10^7 M_sun, approximately 2% of the HI mass. The GMCs in our catalog are confined largely to the central region (R < 4 kpc). They show a remarkable spatial and kinematic correlation with overdense HI filaments; the geometry suggests that the formation of GMCs follows that of the filaments. The GMCs exhibit a mass spectrum dN/dM ~ M^(-2.6 +/- 0.3), considerably steeper than that found in the Milky Way and in the LMC. Combined with the total mass, this steep function implies that the GMCs in M33 form with a characteristic mass of 7 X 10^4 M_sun. More than 2/3 of the GMCs have associated HII regions, implying that the GMCs have a short quiescent period. Our results suggest the rapid assembly of molecular clouds from atomic gas, with prompt onset of massive star formation.Comment: 19 pages, Accepted for Publication in the Astrophysical Journal Supplemen

    Stepwise classification of cancer samples using clinical and molecular data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient.</p> <p>Results</p> <p>We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples.</p> <p>Conclusions</p> <p>Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package <it>stepwiseCM </it>and available at the Bioconductor website.</p

    Clusters of Extragalactic Ultra Compact HII Regions

    Get PDF
    We report on the detection of optically thick free-free radio sources in the galaxies M33, NGC 253, and NGC 6946 using data in the literature. We interpret these sources as being young, embedded star birth regions, which are likely to be clusters of ultracompact HII regions. All 35 of the sources presented in this article have positive radio spectral indices alpha>0 suggesting an optically thick thermal bremsstrahlung emission arising in the HII region surrounding hot stars. Energy requirements indicate a range of a several to >500 O7V star equivalents powering each HII region. Assuming a Salpeter IMF, this corresponds to integrated stellar masses of 0.1--60,000 Msun. For roughly half of the sources in our sample, there is no obvious optical counterpart, giving further support for their deeply embedded nature. Their luminosities and radio spectral energy distributions are consistent with HII regions having electron densities from 1500 cm^-3 to 15000 cm^-3 and radii of 1 - 7 pc. We suggest that the less luminous of these sources are extragalactic ultracompact HII region complexes, those of intermediate luminosity are similar to W49 in the Galaxy, while the brightest will be counterparts to 30 Doradus. These objects constitute the lower mass range of extragalactic ``ultradense HII regions'' which we argue are the youngest stages of massive star cluster formation yet observed. This sample is beginning to fill in the continuum of objects between small associations of ultracompact HII regions and the massive extragalactic clusters that may evolve into globular clusters.Comment: 37 pages, uses AASTeX; scheduled to appear in ApJ v. 559 October 2001. Full postscript version available from http://www.astro.wisc.edu/~chip/Papers/Johnson_Kobulnicky_etal_ApJ559.ps.g
    corecore