564 research outputs found

    Variable Selection and Parameter Tuning in High-Dimensional Prediction

    Get PDF
    In the context of classification using high-dimensional data such as microarray gene expression data, it is often useful to perform preliminary variable selection. For example, the k-nearest-neighbors classification procedure yields a much higher accuracy when applied on variables with high discriminatory power. Typical (univariate) variable selection methods for binary classification are, e.g., the two-sample t-statistic or the Mann-Whitney test. In small sample settings, the classification error rate is often estimated using cross-validation (CV) or related approaches. The variable selection procedure has then to be applied for each considered training set anew, i.e. for each CV iteration successively. Performing variable selection based on the whole sample before the CV procedure would yield a downwardly biased error rate estimate. CV may also be used to tune parameters involved in a classification method. For instance, the penalty parameter in penalized regression or the cost in support vector machines are most often selected using CV. This type of CV is usually denoted as "internal CV" in contrast to the "external CV" performed to estimate the error rate, while the term "nested CV" refers to the whole procedure embedding two CV loops. While variable selection and parameter tuning have been widely investigated in the context of high-dimensional classification, it is still unclear how they should be combined if a classification method involves both variable selection and parameter tuning. For example, the k-nearest-neighbors method usually requires variable selection and involves a tuning parameter: the number k of neighbors. It is well-known that variable selection should be repeated for each external CV iteration. But should we also repeat variable selection for each it internal CV iteration or rather perform tuning based on fixed subset of variables? While the first variant seems more natural, it implies a huge computational expense and its benefit in terms of error rate remains unknown. In this paper, we assess both variants quantitatively using real microarray data sets. We focus on two representative examples: k-nearest-neighbors (with k as tuning parameter) and Partial Least Squares dimension reduction followed by linear discriminant analysis (with the number of components as tuning parameter). We conclude that the more natural but computationally expensive variant with repeated variable selection does not necessarily lead to better accuracy and point out the potential pitfalls of both variants

    Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation

    Get PDF
    High-dimensional binary classification tasks, e.g. the classification of microarray samples into normal and cancer tissues, usually involve a tuning parameter adjusting the complexity of the applied method to the examined data set. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are published. The contribution of this paper is two-fold. Firstly, we develop a new method for tuning bias correction which can be motivated by decision theoretic considerations. The method is based on the decomposition of the unconditional error rate involving the tuning procedure. Our corrected error estimator can be written as a weighted mean of the errors obtained using the different tuning parameter values. It can be interpreted as a smooth version of nested cross-validation (NCV) which is the standard approach for avoiding tuning bias. In contrast to NCV, the weighting scheme of our method guarantees intuitive bounds for the corrected error. Secondly, we suggest to use bias correction methods also to address the bias resulting from the optimal choice of the classification method among several competitors. This method selection bias is particularly relevant to prediction problems in high-dimensional data. In the absence of standards, it is common practice to try several methods successively, which can lead to an optimistic bias similar to the tuning bias. We demonstrate the performance of our method to address both types of bias based on microarray data sets and compare it to existing methods. This study confirms that our approach yields estimates competitive to NCV at a much lower computational price

    The Impact of Music Therapy on Language Acquisition in Children with Nonverbal Autism

    Get PDF
    Through an experimental method, the researcher investigated whether children with autism spectrum disorder (ASD) are more likely to develop verbal communication skills after consistent exposure to songs with lyrics. Six children with nonverbal ASD were exposed to the same song with lyrics, with the goal of increased vocalization and language acquisition. Over nine sessions, subjects were pulled to participate in the experiment. The researcher played the song for the participants, recording the responses from each trial and categorizing them as either full words, verbal approximations, or miscellaneous verbalizations. The findings of the study suggest that there is a relationship between music and language acquisition in children with nonverbal autism spectrum disorder

    Fluid inclusion analyses of Bedded Halite from the Neoproterozoic Browne Formation of Central Australia to Determine Parent Water Origin and pH

    Get PDF
    The depositional origin of the Neoproterozoic Browne Formation of Australia is unclear. The Browne Formation consists of red-siliciclastic mudstone and siltstone (red beds), bedded and displacive halite and gypsum, and dolomitized stromatolites. Environmental interpretations of the Browne Formation range from a marginal marine lagoon to a playa lake deposit (Grey and Blake, 1999; Hill et al., 2000; Haines et al., 2004; Spear, 2013).;Primary fluid inclusions in Browne Formation bedded halite are the oldest known surface water remnants (Spear et al., 2014). Prior major-ion ratio analysis of these inclusions indicates that the Browne Formation formed from waters with unusually low sulfate concentrations. These waters differed significantly from modern seawater (Spear et al., 2014).;This study tested the non-marine interpretation of the Browne Formation via geochemical analyses. A non-marine interpretation is supported by the sedimentological characteristics it shares with geochemically distinct evaporites and red beds from acid-saline-continental settings (pH \u3c 1; Benison and Goldstein, 2002). This thesis is the first examination of primary fluid inclusions from bedded halite from the Neoproterozoic Browne Formation through detailed fluid inclusion petrography, freezing-melting microthermometry, and laser Raman spectroscopy.;This thesis found that the Browne Formation inclusions are geochemically distinct. Freezing-melting attempts to determine major-ion composition and salinity were unsuccessful, this rare outcome indicates high inclusion salinity, and possible low pH. Efforts to determine pH through laser Raman spectroscopy were unsuccessful and do not indicate low pH (\u3c1). Furthermore, no spectra of solutes or solids indicative of pH were detected. However, Raman analysis did detect anhydrite, disordered graphite, iron oxides, and several unidentified solids. Inclusion solids include exceptionally well-preserved suspect microbial life, including prokaryotes, spherules similar to Dunaleilla algae, and possible algal mats.;This study found no conclusive data of marine or non-marine Browne Formation parent waters. There is no diagnostic evidence of acid parent waters in the Browne Formation. Due to its unique freezing-melting and laser Raman characteristics, the Browne Formation may be a geologically unique evaporite deposit

    Energy in the Corn Belt: Is Maize Production Sustainable?

    Get PDF
    Technological and scientific innovation has transformed agricultural production. Corn production methods changed from a sustainable, nutrient recycling production system to one reliant on imported fossil energy inputs. Located in the Western Corn Belt, Union County, South Dakota was chosen as the study area. Changes in production methods are represented by four technological epochs: 1) The Draft Horse Epoch, 1890-1920; 2) The Tractor Epoch, 1920-1950; 3) The Fertilizer Epoch, 1950-1980; and 4) The Biotechnology and Precision Agriculture Epoch, 1980-2010. The energy budget method was used to measure the energy sustainability of corn production. The findings show that the volume of corn grain yield credited to fossil fuels and inorganic fertilizer energy inputs represents the magnitude of the corn crop that is neither sustainable nor renewable

    Alphabetizing the Nation:Medieval British Origins in Thomas Elyot's Dictionary

    Get PDF
    Reading Thomas Elyot's Dictionary, this essay examines the legacy of medieval chronicle and fable for the early modern period. Elyot's influential work, here considered in its 1542 edition as Bibliotheca Eliotae, contains entries for both “Albion” and “Britannia,” topics which plunged the work straight into the problematic inheritance of Galfridian history, recently discredited at Henry VIII's court by the Italian humanist Polydore Vergil. Elyot presents, only to dismiss, medieval legendary origins for Albion and Britain, using what he calls similitudo to find alternative explanations. His dictionary thereby transforms misleading medieval fables into something more “fitting” for England in the early days of the Reformation. Yet similitude remains problematic for Elyot; replacing the medieval Brutus legend with a story that privileges the humanist reconstruction of the illegible fragments of the past, Elyot does not avoid uncomfortable reminiscences of the senseless destruction of past cultural objects.</jats:p

    Your Date

    Get PDF
    It doesn\u27t matter who he is or what you\u27re doing; he\u27s your date, and he\u27s thinking- about YOU! Here are the opinions of some campus leaders on that ever-present topic of dating. The ideas of your date tonight may parallel these- so here\u27s a chance to gain some insight

    The geology and geochemistry of the Lumwana Basement hosted copper-cobalt (uranium) deposits, NW Zambia

    No full text
    The Lumwana Cu±Co deposits Malundwe and Chimiwungo are examples of pre-Katangan mineralized basement that are located in the Domes Region of the Lufilian Arc, an arcuate North neo-Proterozoic fold belt, which hosts the Zambian and Congolese deposits that make up the Central African Copperbelt. The Lumwana deposits are situated within the Mwombezhi Dome; a Mesoproterozoic basement inlier consisting of highly sheared amphibolite grade schist to gneiss units that host the Cu±Co mineralization. Kinematic indicators such as s-c fabrics and pressure shadows on porphyroblasts suggest a top to the North shear sense. Peak metamorphism of 750ºC ± 25ºC and 13 ± 1 Kb indicated by whiteschist assemblages occurred during the Lufilian Orogeny at ~530Ma, with burial depths of ~50km. A major decollement separates the high pressure mineral assemblages of the basement from the lower pressure mineral assemblages of the overlying Katangan Supergroup. The age range and lithologies of the pre-Katangan basement of the Domes Region is similar to the pre-Katangan basement of the Kafue Anticline, which underlies the neo-Proterozoic Zambian Copperbelt deposits situated 220km to the SW. The origin of the protolith to the mineralization is ambiguous at Lumwana with transitional contacts from unmineralized quartz-feldspar±phlogopite basement gneiss to Cu±Co mineralized quartz-phlogopite-muscovite-kyanite-sulphide Ore Schist. The transitional contacts and structural controls on mineralization has led to the hypothesis that these deposits represent metasomatically altered, mineralized and sheared basement, rather than mineralized neo-Proterozoic sediments with amphibolite grade metamorphism. This hypothesis is supported by petrographic analysis, stable isotope analysis (?34S), whole rock geochemistry, and electron microprobe analysis of ore and host rock assemblages. The transitional contacts observed at Lumwana are due to an alteration event associated with mineralization that removed feldspar from ore horizons resulting in depleted Na and Ca and relatively higher Al components. Sulphides are deformed by the S1 fabric and overprinted by kyanite which formed at peak metamorphism. This indicates that copper was introduced to the basement either syn or pre-peak metamorphism. Post S1 metamorphism with associated quartz-muscovite alteration has remobilized sulphides into low strain zones and pressure shadows around porphyroblasts. ?34SSULPHIDES give values of +2.3 to +18.5‰ that fall within the range of values observed in the Copperbelt of -17 to +23‰. The mechanism of ore formation at Lumwana was dominated by thermochemical sulphate reduction (TSR), indicated by the relatively heavy ?34S values and the absence of the light bacteriogenic ?34S values observed in the Copperbelt. Electron microprobe data of muscovite, phlogopite and chlorite show little variation between early and late mineral phases indicating that metamorphic homogenization of silicate assemblages occurred. The Lumwana deposits are large mineralized shear zones within the pre-Katangan basement. Various styles of basement mineralization are also observed in the Kafue Anticline and the structural controls on mineralization and lithological similarities to the Lumwana deposits suggests that pre-Katangan basement is a viable source for the Cu-Co budget of the Central African Copperbelt and that basement structures had a key role in its formation

    Wrapper algorithms and their performance assessment on high-dimensional molecular data

    Get PDF
    Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen ’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer- den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein. Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna- tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper- Algorithmen’ zur Uberlebenszeitenanalyse werden thematisiert
    corecore