38 research outputs found
Recommended from our members
Are we meeting a deadline? classification goal achievement in time in the presence of imbalanced data
This paper addresses the problem of a finite set of entities which are required to achieve a goal within a predefined deadline. For example, a group of students is supposed to submit a homework by a specified cutoff. Further, we are interested in predicting which entities will achieve the goal within the deadline. The predictive models are built based only on the data from that population. The predictions are computed at various time instants by taking into account updated data about the entities. The first contribution of the paper is a formal description of the problem. The important characteristic of the proposed method for model building is the use of the properties of entities that have already achieved the goal. We call such an approach “Self-Learning”. Since typically only a few entities have achieved the goal at the beginning and their number gradually grows, the problem is inherently imbalanced. To mitigate the curse of imbalance, we improved the Self-Learning method by tackling information loss and by several sampling techniques. The original Self-Learning and the modifications have been evaluated in a case study for predicting submission of the first assessment in distance higher education courses. The results show that the proposed improvements outperform the specified two base-line models and the original Self-Learner, and also that the best results are achieved if domain-driven techniques are utilised to tackle the imbalance problem. We also showed that these improvements are statistically significant using Wilcoxon signed rank test
Ouroboros: early identification of at-risk students without models based on legacy data
This paper focuses on the problem of identifying students, who are at risk of failing their course. The presented method proposes a solution in the absence of data from previous courses, which are usually used for training machine learning models. This situation typically occurs in new courses. We present the concept of a "self-learner" that builds the machine learning models from the data generated during the current course. The approach utilises information about already submitted assessments, which introduces the problem of imbalanced data for training and testing the classification models.
There are three main contributions of this paper: (1) the concept of training the models for identifying at-risk students using data from the current course, (2) specifying the problem as a classification task, and (3) tackling the challenge of imbalanced data, which appears both in training and testing data.
The results show the comparison with the traditional approach of learning the models from the legacy course data, validating the proposed concept
FireProt: web server for automated design of thermostable proteins
There is a continuous interest in increasing proteins stability to enhance their usability in numerous biomedical and biotechnological applications. A number of in silico tools for the prediction of the effect of mutations on protein stability have been developed recently. However, only single-point mutations with a small effect on protein stability are typically predicted with the existing tools and have to be followed by laborious protein expression, purification, and characterization. Here, we present FireProt, a web server for the automated design of multiple-point thermostable mutant proteins that combines structural and evolutionary information in its calculation core. FireProt utilizes sixteen tools and three protein engineering strategies for making reliable protein designs. The server is complemented with interactive, easy-to-use interface that allows users to directly analyze and optionally modify designed thermostable mutants. FireProt is freely available at http://loschmidt.chemi.muni.cz/fireprot.Web of Science45W1W399W39
SoluProt: prediction of soluble protein expression in Escherichia coli
Motivation: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritization of highly soluble proteins. Results: A new tool for sequence-based prediction of soluble protein expression in E.coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt's accuracy of 58.5% and AUC of 0.62 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies
EnzymeMiner: Exploration of sequence space of enzymes
Please click Additional Files below to see the full abstract
EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities
Millions of protein sequences are being discovered at an incredible pace, representing an inexhaustible source of biocatalysts. Despite genomic databases growing exponentially, classical biochemical characterization techniques are time-demanding, cost-ineffective and low-throughput. Therefore, computational methods are being developed to explore the unmapped sequence space efficiently. Selection of putative enzymes for biochemical characterization based on rational and robust analysis of all available sequences remains an unsolved problem. To address this challenge, we have developed EnzymeMiner-a web server for automated screening and annotation of diverse family members that enables selection of hits for wet-lab experiments. EnzymeMiner prioritizes sequences that are more likely to preserve the catalytic activity and are heterologously expressible in a soluble form in Escherichia coli. The solubility prediction employs the in-house SoluProt predictor developed using machine learning. EnzymeMiner reduces the time devoted to data gathering, multi-step analysis, sequence prioritization and selection from days to hours. The successful use case for the haloalkane dehalogenase family is described in a comprehensive tutorial available on the EnzymeMiner web page
System for functional annotation of single nucleotide polymorphisms
Single nucleotide polymorphisms are the substitution of one nucleotide in the DNA sequence that may or may not have phenotypic consequences. Here we describe a new system for ranking non-synonymous protein substitutions by their deleterious effects. The computational core of the proposed system is based on a rational combination of the results from the selected subset of publicly available tools. The weight coefficients for the individual tools are calculated on the basis of their confidence score and their reliabilities are assigned accordingly to their performance measured on the extensive dataset. The validation of the performance on the dataset consisting of 5 000 substitutions shows that overall accuracy of the system was improved by 6% in comparison to the simple majority vote
Distributed information system as a system of asynchronous concurrent processes
Abstract. Nowadays enterprise information systems are designed as distributed network systems, where existing information systems and new components are connected together via a middleware. In most cases, architectures of the systems can be described informally or semiformally by means of common design tools. But there are also critical applications where an information system is getting involved, and a formal architecture specification is necessary. This paper describes a design of a framework for distributed information systems with a mobile architecture and an outline of its implementation. The framework provides an automatic derivation of a formal specification from an implementation of system, without an explicit formal description in a design phase of project. The derived specification can be used for a quick formal proof of correctness after radical changes in an implementation phase, without a maintenance of a formal design.