11,898 research outputs found

    Data mining as a tool for environmental scientists

    Get PDF
    Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous

    Machine learning and data mining frameworks for predicting drug response in cancer:An overview and a novel <i>in silico</i> screening process based on association rule mining

    Get PDF

    A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

    Full text link
    The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013 International Conference on Data Minin

    Machine learning for the prediction of protein-protein interactions

    Get PDF
    The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex. Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results. The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques. Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data. In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes. Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available

    A Microscopic Simulation Laboratory for Evaluation of Off-street Parking Systems

    Get PDF
    The parking industry produces an enormous amount of data every day that, properly analyzed, will change the way the industry operates. The collected data form patterns that, in most cases, would allow parking operators and property owners to better understand how to maximize revenue and decrease operating expenses and support the decisions such as how to set specific parking policies (e.g. electrical charging only parking space) to achieve the sustainable and eco-friendly parking. However, there lacks an intelligent tool to assess the layout design and operational performance of parking lots to reduce the externalities and increase the revenue. To address this issue, this research presents a comprehensive agent-based framework for microscopic off-street parking system simulation. A rule-based parking simulation logic programming model is formulated. The proposed simulation model can effectively capture the behaviors of drivers and pedestrians as well as spatial and temporal interactions of traffic dynamics in the parking system. A methodology for data collection, processing, and extraction of user behaviors in the parking system is also developed. A Long-Short Term Memory (LSTM) neural network is used to predict the arrival and departure of the vehicles. The proposed simulator is implemented in Java and a Software as a Service (SaaS) graphic user interface is designed to analyze and visualize the simulation results. This study finds the active capacity of the parking system, which is defined as the largest number of actively moving vehicles in the parking system under the facility layout. In the system application of the real world testbed, the numerical tests show (a) the smart check-in device has marginal benefits in vehicle waiting time; (b) the flexible pricing policy may increase the average daily revenue if the elasticity of the price is not involved; (c) the number of electrical charging only spots has a negative impact on the performance of the parking facility; and (d) the rear-in only policy may increase the duration of parking maneuvers and reduce the efficiency during the arrival rush hour. Application of the developed simulation system using a real-world case demonstrates its capability of providing informative quantitative measures to support decisions in designing, maintaining, and operating smart parking facilities

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF
    corecore