269 research outputs found

    Evaluating Microarray-based Classifiers: An Overview

    Get PDF
    For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy

    An AUC-based Permutation Variable Importance Measure for Random Forests

    Get PDF
    The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

    The Normal Fetal Heart Rate Study: Analysis Plan

    Get PDF
    Recording of fetal heart rate via CTG monitoring has been routinely performed as an important part of antenatal and subpartum care for several decades. The current guidelines of the FIGO (ref1) recommend a normal range of the fetal heart rate from 110 to 150 bpm. However, there is no agreement in the medical community whether this is the correct range (ref2). We aim to address this question by computerized analysis (ref 3) of a high quality database (HQDb, ref 4) of about one billion electronically registered fetal heart rate measurements from about 10,000 pregnancies in three medical centres over seven years. In the present paper, we lay out a detailed analysis plan for this evidence-based project in the vein of the validation policy of the Sylvia Lawry Centre for Multiple Sclerosis Research (ref 5) with a split of the database into an exploratory part and a part reserved for validation. We will perform the analysis and the validation after publication of this plan in order to reduce the probability of publishing false positive research findings (ref 6-7)

    Extented ionized gas emission and kinematics of the compact group galaxies in HCG 16: Signatures of mergers

    Get PDF
    We report on kinematic observations of Ha emission line from four late-type galaxies of Hickson Compact Group 16 (H16a,b,c and d) obtained with a scanning Fabry-Perot interferometer and samplings of 16 km/s and 1". The velocity fields show kinematic peculiarities for three of the four galaxies: H16b, c and d. Misalignments between the kinematic and photometric axes of gas and stellar components (H16b,c,d), double gas systems (H16c) and severe warping of the kinematic major axis (H16b and c) were some of the peculiarities detected. We conclude that major merger events have taken place in at least two of the galaxies group. H16c and d, based on their significant kinematic peculiarities, their double nuclei and high infrared luminosities. Their Ha gas content is strongly spatially concentred - H16d contains a peculiar bar-like structure confined to the inner \sim 1 h^-1 kpc region. These observations are in agreement with predictions of simulations, namely that the gas flows towards the galaxy nucleus during mergers, forms bars and fuel the central activity. Galaxy H16b, and Sb galaxy, also presents some of the kinematic evidences for past accretion events. Its gas content, however, is very spare, limiting our ability to find other kinematic merging indicators, if they are present. We find that isolated mergers, i.e., they show an anormorphous morphology and no signs of tidal tails. Tidal arms and tails formed during the mergers may have been stripped by the group potential (Barnes & Hernquist 1992) ar alternatively they may have never been formed. Our observations suggest that HCG 16 may be a young compact group in formation throught the merging of close-by objects in a dense environment.Comment: Accepted for publication in ApJ. 35 pages, 13 figures. tar file gzipped and uuencode

    Testing the additional predictive value of high-dimensional molecular data

    Get PDF
    While high-dimensional molecular data such as microarray gene expression data have been used for disease outcome prediction or diagnosis purposes for about ten years in biomedical research, the question of the additional predictive value of such data given that classical predictors are already available has long been under-considered in the bioinformatics literature. We suggest an intuitive permutation-based testing procedure for assessing the additional predictive value of high-dimensional molecular data. Our method combines two well-known statistical tools: logistic regression and boosting regression. We give clear advice for the choice of the only method parameter (the number of boosting iterations). In simulations, our novel approach is found to have very good power in different settings, e.g. few strong predictors or many weak predictors. For illustrative purpose, it is applied to two publicly available cancer data sets. Our simple and computationally efficient approach can be used to globally assess the additional predictive power of a large number of candidate predictors given that a few clinical covariates or a known prognostic index are already available

    The kinematics of the warm gas in Hickson compact group of galaxies HCG 90

    Full text link
    We present kinematic observations of Hα\alpha emission for two early-type galaxies and one disk system, members of the Hickson compact group 90 (HCG 90) obtained with a scanning Fabry-Perot interferometer and samplings of 16 kmkm s1s^{-1} and 1\arcsec. Mapping of the gas kinematics was possible to \sim 2 reff_{eff} for the disk galaxy N7174 and to \sim 1.3 reff_{eff} and \sim 1.7 reff_{eff} for the early-type galaxies N7176 and N7173 respectively. Evidence for ongoing interaction was found in the properties of the warm gas of the three galaxies, some of which do not have stellar counterparts. We suggest the following evolutionary scenario for the system. H90d is the warm gas reservoir of the group in process of fueling H90b with gas. H90c and d have experienced past interaction with gas exchange. The gas acquired by H90c has already settled and relaxed but the effects of the interaction can still be visible in the morphology of the two galaxies and their stellar kinematics. This process will possibly result in a major merger.Comment: 32 pages - 10 figures, Accepted for publication in Astronomical Journa

    Combined direct-sun ultraviolet and infrared spectroscopies at Popocatépetl volcano (Mexico)

    Get PDF
    Volcanic plume composition is strongly influenced by both changes in magmatic systems and plume-atmosphere interactions. Understanding the degassing mechanisms controlling the type of volcanic activity implies deciphering the contributions of magmatic gases reaching the surface and their posterior chemical transformations in contact with the atmosphere. Remote sensing techniques based on direct solar absorption spectroscopy provide valuable information about most of the emitted magmatic gases but also on gas species formed and converted within the plumes. In this study, we explore the procedures, performances and benefits of combining two direct solar absorption techniques, high resolution Fourier Transform Infrared Spectroscopy (FTIR) and Ultraviolet Differential Optical Absorption Spectroscopy (UV-DOAS), to observe the composition changes in the Popocatépetl’s plume with high temporal resolution. The SO2 vertical columns obtained from three instruments (DOAS, high resolution FTIR and Pandora) were found similar (median difference <12%) after their intercalibration. We combined them to determine with high temporal resolution the different hydrogen halide and halogen species to sulfur ratios (HF/SO2_{2}, BrO/SO2_{2}, HCl/SO2_{2}, SiF4_{4}/SO2_{2}, detection limit of HBr/SO2_{2}) and HCl/BrO in the Popocatépetl’s plume over a 2.5-years period (2017 to mid-2019). BrO/SO2_{2}, BrO/HCl, and HCl/SO2_{2} ratios were found in the range of (0.63 ± 0.06 to 1.14 ± 0.20) × 104^{–4}, (2.6 ± 0.5 to 6.9 ± 2.6) × 104^{–4}, and 0.08 ± 0.01 to 0.21 ± 0.01 respectively, while the SiF4/SO2_{2} and HF/SO2_{2} ratios were found fairly constant at (1.56 ± 0.25) × 103^{–3} and 0.049 ± 0.001. We especially focused on the full growth/destruction cycle of the most voluminous lava dome of the period that took place between February and April 2019. A decrease of the HCl/SO2_{2} ratio was observed with the decrease of the extrusive activity. Furthermore, the short-term variability of BrO/SO2_{2} is measured for the first time at Popocatépetl volcano together with HCl/SO2_{2}, revealing different behaviors with respect to the volcanic activity. More generally, providing such temporally resolved and near-real-time time series of both primary and secondary volcanic gaseous species is critical for the management of volcanic emergencies, as well as for the understanding of the volcanic degassing processes and their impact on the atmospheric chemistry

    Extended HI Rotation Curve and Mass Distribution of M31

    Full text link
    New HI observations of Messier 31 (M31) obtained with the Effelsberg and Green Bank 100-m telescopes make it possible to measure the rotation curve of that galaxy out to ~35 kpc. Between 20 and 35 kpc, the rotation curve is nearly flat at a velocity of ~226 km/s. A model of the mass distribution shows that at the last observed velocity point, the minimum dark-to-luminous mass ratio is \~0.5 for a total mass of 3.4 10^11 Msol at R < 35 kpc. This can be compared to the estimated MW mass of 4.9 10^11 Msol for R < 50 kpc.Comment: 4 pages, 2 figures, accepted for publication in ApJ Letter

    Superbubble evolution including the star-forming clouds: Is it possible to reconcile LMC observations with model predictions?

    Get PDF
    Here we present a possible solution to the apparent discrepancy between the observed properties of LMC bubbles and the standard, constant density bubble model. A two-dimensional model of a wind-driven bubble expanding from a flattened giant molecular cloud is examined. We conclude that the expansion velocities derived from spherically symmetric models are not always applicable to elongated young bubbles seen almost face-on due to the LMC orientation. In addition, an observational test to differentiate between spherical and elongated bubbles seen face-on is discussed.Comment: 25 pages, 7 figures, accepted to ApJ (September, 1999 issue

    Bias in random forest variable importance measures: Illustrations, sources and a solution

    Get PDF
    BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research
    corecore