36,323 research outputs found

    Stratification bias in low signal microarray studies

    Get PDF
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

    Data assimilation of in situ soil moisture measurements in hydrological models: third annual doctoral progress report, work plan and achievements

    Get PDF
    Efficient water utilization and optimal water supply/distribution to increase food and fodder productivity are of utmost importance in confronting worldwide water scarcity, climate change, growing populations and increasing water demands. In this respect, irrigation efficiency, which is influenced by the type of irrigation and irrigation scheduling, is an essential issue for achieving higher productivity. To improve irrigation strategies in precision agriculture, soil water status can be more accurately described using a combination of advanced monitoring and modeling. Our study focuses on the combination of high resolution hydrological data with hydrological models that predict water flow and solute (pollutants and salts) transport and water redistribution in agricultural soils under irrigation. Field plots of a potato farmer in a sandy region in Belgium were instrumented to continuously monitor soil moisture and water potential before, during and after irrigation in dry summer periods. The aim is to optimize the irrigation process by assimilating online sensor field data into process based models. This research is part of Activity 305 ‘Precision agriculture and remote sensing’ of the VITO GWO and is also part of the strategic cooperation with UGent within the platform ‘Managing Natural Resources’. Over the past 2 years, we applied a combination of in-situ monitoring and numerical modeling -Hydrus 1D- to estimate water content fluctuations in a heterogeneous sandy grassland soil under irrigation with water table fluctuating between 80 and 155 cm. Over the last year, more sampling and analyses were carried out to further characterize the hydraulic properties over the entire field. Modeling results for the field demonstrated clearly the profound effect of the position of the GWL, and to a lesser extent, the effect of spatially variable soil hydraulic properties (Ks, n and α) on the estimated water content in the sandy two-layered soil under grass. Our results show that currently applied uniform water distribution using sprinkler irrigation seems not to be efficient since at locations with shallow groundwater, the amount of water applied will be excessive as compared to the plant requirements while in locations with a deeper GWL, requirements will not be met. To derive the optimal parameter set best describing the measured soil moisture content, 37 optimization scenarios were conducted with two to six parameters using various parameter combinations for the two soil layers. The best performing parameter optimization scenario was a 2-parameter scenario with Ks optimized for each layer. The results showed a better identifiability of the parameters (less correlations among parameters) with equal performance as compared to three, four or six parameter optimization. Model predictions using the calibrated model (with data from 2012) for an independent data set of soil moisture data in the validation period (2013) showed satisfactory performance of the model in view of irrigation management purposes. Comparing the degree of water stress for different optimization scenarios of groundwater depth, showed that grass was exposed to water stress in summer in 2013 but not for such a long period as compared to the 2012 growing season. The degree of water stress simulated with Hydrus 1D suggested to increase the irrigation amount in 2012 and 2013 and at least one or two times in the summer (June and July) and further distributing the amount of irrigation during the growing season, instead of using a huge amount of irrigation later in the season, as is common practice by the farmer. A second part of the study focused on finding a relation between measured soil hydraulic properties and apparent electrical conductivity ECa. Our measurements of hydraulic properties of the field clearly confirm that there is considerable spatial variability in the field and that this has an impact on the simulation of soil moisture content. Therefore this should be taken into account when upscaling soil hydraulic properties to the field scale in order to in understand and model flow, solute and energy fluxes in the field and develop strategies for efficient irrigation. Upscaling soil hydraulic properties to the field scale can be done by linking them to apparent electrical conductivity (ECa), which can be measured efficiently and inexpensively so a spatially dense dataset for describing within-field spatial soil variability can be generated. In this study relations between the spatial variation of soil hydraulic properties and apparent soil electrical conductivity ECa measured with EM38 and DUALEM-21S sensors at two depths of explorations (DOE) 0-50 and 0-100 cm were investigated. Two predictive modelling approaches, i.e. i) a simple regression and ii) applying Archie’s laws for saturated and unsaturated conditions in combination with MVG equations, were developed and it was compared how they were able to explain the observed values of hydraulic parameters. Results demonstrated the spatial variability and heterogeneity of ECa and soil hydraulic properties Ks, α and n. We derived a regression relationship between log Ks and ECa measured with DUALEM (r2≥0.70) and with EM38 (r2>0.46) sensors. The predicted results were tested vs measured data and confirmed that the performance of DUALEMp,100-Ks model is relatively better than that of the same sensor with lower DOE and of the EM38 sensor (RMSE = 1.31 cmh-1, R2 = 0.55). The relationships between MVG shape parameters and ECa datasets were generally poor (0.05<R2<0.26). In the second approach, we showed that the water retention curve can be translated to ECa-(h) and ECa-Se relations by combining the MVG equations and Archie’s law. Results also show that reformulating the MVG equations based on ECa-Se relationships can help to estimate unsaturated hydraulic conductivity at the field scale. In the third year, a second study site has been set up in a nearby field where potatoes are grown and has been instrumented with soil moisture sensors, tensiometers, groundwater level loggers and a weather station. Field hydraulic properties for the field will be derived using the equations developed for the first study site and the modeling approach developed for the first field will be tested here. Also quasi 3D-modelling of water flow at the field scale will be conducted. In this modeling set-up, the field will be modeled as a collection of 1D-columns representing the different field conditions (combination of soil properties, GWL, root zone depth). Combining this model with crop based models such as LINGRA-N or Aquacrop gives a direct simulation of the impact of irrigation strategies on crop yield at the field scale

    Mathematical modelling of polyamine metabolism in bloodstream-form trypanosoma brucei: An application to drug target identification

    Get PDF
    © 2013 Gu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are creditedThis article has been made available through the Brunel Open Access Publishing Fund.We present the first computational kinetic model of polyamine metabolism in bloodstream-form Trypanosoma brucei, the causative agent of human African trypanosomiasis. We systematically extracted the polyamine pathway from the complete metabolic network while still maintaining the predictive capability of the pathway. The kinetic model is constructed on the basis of information gleaned from the experimental biology literature and defined as a set of ordinary differential equations. We applied Michaelis-Menten kinetics featuring regulatory factors to describe enzymatic activities that are well defined. Uncharacterised enzyme kinetics were approximated and justified with available physiological properties of the system. Optimisation-based dynamic simulations were performed to train the model with experimental data and inconsistent predictions prompted an iterative procedure of model refinement. Good agreement between simulation results and measured data reported in various experimental conditions shows that the model has good applicability in spite of there being gaps in the required data. With this kinetic model, the relative importance of the individual pathway enzymes was assessed. We observed that, at low-to-moderate levels of inhibition, enzymes catalysing reactions of de novo AdoMet (MAT) and ornithine production (OrnPt) have more efficient inhibitory effect on total trypanothione content in comparison to other enzymes in the pathway. In our model, prozyme and TSHSyn (the production catalyst of total trypanothione) were also found to exhibit potent control on total trypanothione content but only when they were strongly inhibited. Different chemotherapeutic strategies against T. brucei were investigated using this model and interruption of polyamine synthesis via joint inhibition of MAT or OrnPt together with other polyamine enzymes was identified as an optimal therapeutic strategy.The work was carried out under a PhD programme partly funded by Prof. Ray Welland, School of Computing Science, University of Glasgo

    Making Software Cost Data Available for Meta-Analysis

    Get PDF
    In this paper we consider the increasing need for meta-analysis within empirical software engineering. However, we also note that a necessary precondition to such forms of analysis is to have both the results in an appropriate format and sufficient contextual information to avoid misleading inferences. We consider the implications in the field of software project effort estimation and show that for a sample of 12 seemingly similar published studies, the results are difficult to compare let alone combine. This is due to different reporting conventions. We argue that a protocol is required and make some suggestions as to what it should contain

    Multi-impurity adsorption model for modeling crystal purity and shape evolution during crystallization processes in impure media

    Get PDF
    ACKNOWLEDGMENTS Financial support provided by the European Research Council Grant No. [280106-CrySys] is gratefully acknowledged.Peer reviewe
    corecore