42,922 research outputs found

    Bagging ensemble selection for regression

    Get PDF
    Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classification problems have shown that using random trees as base classifiers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classification, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles

    Visual Integration of Data and Model Space in Ensemble Learning

    Full text link
    Ensembles of classifier models typically deliver superior performance and can outperform single classifier models given a dataset and classification task at hand. However, the gain in performance comes together with the lack in comprehensibility, posing a challenge to understand how each model affects the classification outputs and where the errors come from. We propose a tight visual integration of the data and the model space for exploring and combining classifier models. We introduce a workflow that builds upon the visual integration and enables the effective exploration of classification outputs and models. We then present a use case in which we start with an ensemble automatically selected by a standard ensemble selection algorithm, and show how we can manipulate models and alternative combinations.Comment: 8 pages, 7 picture

    Experimental library screening demonstrates the successful application of computational protein design to large structural ensembles

    Get PDF
    The stability, activity, and solubility of a protein sequence are determined by a delicate balance of molecular interactions in a variety of conformational states. Even so, most computational protein design methods model sequences in the context of a single native conformation. Simulations that model the native state as an ensemble have been mostly neglected due to the lack of sufficiently powerful optimization algorithms for multistate design. Here, we have applied our multistate design algorithm to study the potential utility of various forms of input structural data for design. To facilitate a more thorough analysis, we developed new methods for the design and high-throughput stability determination of combinatorial mutation libraries based on protein design calculations. The application of these methods to the core design of a small model system produced many variants with improved thermodynamic stability and showed that multistate design methods can be readily applied to large structural ensembles. We found that exhaustive screening of our designed libraries helped to clarify several sources of simulation error that would have otherwise been difficult to ascertain. Interestingly, the lack of correlation between our simulated and experimentally measured stability values shows clearly that a design procedure need not reproduce experimental data exactly to achieve success. This surprising result suggests potentially fruitful directions for the improvement of computational protein design technology

    Sentiment Analysis using an ensemble of Feature Selection Algorithms

    Get PDF
    To determine the opinion of any person experiencing any services or buying any product, the usage of Sentiment Analysis, a continuous research in the field of text mining, is a common practice. It is a process of using computation to identify and categorize opinions expressed in a piece of text. Individuals post their opinion via reviews, tweets, comments or discussions which is our unstructured information. Sentiment analysis gives a general conclusion of audits which benefit clients, individuals or organizations for decision making. The primary point of this paper is to perform an ensemble approach on feature reduction methods identified with natural language processing and performing the analysis based on the results. An ensemble approach is a process of combining two or more methodologies. The feature reduction methods used are Principal Component Analysis (PCA) for feature extraction and Pearson Chi squared statistical test for feature selection. The fundamental commitment of this paper is to experiment whether combined use of cautious feature determination and existing classification methodologies can yield better accuracy

    Mutation supply and the repeatability of selection for antibiotic resistance

    Full text link
    Whether evolution can be predicted is a key question in evolutionary biology. Here we set out to better understand the repeatability of evolution. We explored experimentally the effect of mutation supply and the strength of selective pressure on the repeatability of selection from standing genetic variation. Different sizes of mutant libraries of an antibiotic resistance gene, TEM-1 β\beta-lactamase in Escherichia coli, were subjected to different antibiotic concentrations. We determined whether populations went extinct or survived, and sequenced the TEM gene of the surviving populations. The distribution of mutations per allele in our mutant libraries- generated by error-prone PCR- followed a Poisson distribution. Extinction patterns could be explained by a simple stochastic model that assumed the sampling of beneficial mutations was key for survival. In most surviving populations, alleles containing at least one known large-effect beneficial mutation were present. These genotype data also support a model which only invokes sampling effects to describe the occurrence of alleles containing large-effect driver mutations. Hence, evolution is largely predictable given cursory knowledge of mutational fitness effects, the mutation rate and population size. There were no clear trends in the repeatability of selected mutants when we considered all mutations present. However, when only known large-effect mutations were considered, the outcome of selection is less repeatable for large libraries, in contrast to expectations. Furthermore, we show experimentally that alleles carrying multiple mutations selected from large libraries confer higher resistance levels relative to alleles with only a known large-effect mutation, suggesting that the scarcity of high-resistance alleles carrying multiple mutations may contribute to the decrease in repeatability at large library sizes.Comment: 31pages, 9 figure

    RosettaBackrub--a web server for flexible backbone protein structure modeling and design.

    Get PDF
    The RosettaBackrub server (http://kortemmelab.ucsf.edu/backrub) implements the Backrub method, derived from observations of alternative conformations in high-resolution protein crystal structures, for flexible backbone protein modeling. Backrub modeling is applied to three related applications using the Rosetta program for structure prediction and design: (I) modeling of structures of point mutations, (II) generating protein conformational ensembles and designing sequences consistent with these conformations and (III) predicting tolerated sequences at protein-protein interfaces. The three protocols have been validated on experimental data. Starting from a user-provided single input protein structure in PDB format, the server generates near-native conformational ensembles. The predicted conformations and sequences can be used for different applications, such as to guide mutagenesis experiments, for ensemble-docking approaches or to generate sequence libraries for protein design
    corecore