11,898 research outputs found
Data mining as a tool for environmental scientists
Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous
A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
The combination of multiple classifiers using ensemble methods is
increasingly important for making progress in a variety of difficult prediction
problems. We present a comparative analysis of several ensemble methods through
two case studies in genomics, namely the prediction of genetic interactions and
protein functions, to demonstrate their efficacy on real-world datasets and
draw useful conclusions about their behavior. These methods include simple
aggregation, meta-learning, cluster-based meta-learning, and ensemble selection
using heterogeneous classifiers trained on resampled data to improve the
diversity of their predictions. We present a detailed analysis of these methods
across 4 genomics datasets and find the best of these methods offer
statistically significant improvements over the state of the art in their
respective domains. In addition, we establish a novel connection between
ensemble selection and meta-learning, demonstrating how both of these disparate
methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013
International Conference on Data Minin
Machine learning for the prediction of protein-protein interactions
The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex.
Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results.
The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques.
Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data.
In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes.
Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available
A Microscopic Simulation Laboratory for Evaluation of Off-street Parking Systems
The parking industry produces an enormous amount of data every day that, properly analyzed, will change the way the industry operates. The collected data form patterns that, in most cases, would allow parking operators and property owners to better understand how to maximize revenue and decrease operating expenses and support the decisions such as how to set specific parking policies (e.g. electrical charging only parking space) to achieve the sustainable and eco-friendly parking.
However, there lacks an intelligent tool to assess the layout design and operational performance of parking lots to reduce the externalities and increase the revenue. To address this issue, this research presents a comprehensive agent-based framework for microscopic off-street parking system simulation. A rule-based parking simulation logic programming model is formulated. The proposed simulation model can effectively capture the behaviors of drivers and pedestrians as well as spatial and temporal interactions of traffic dynamics in the parking system. A methodology for data collection, processing, and extraction of user behaviors in the parking system is also developed. A Long-Short Term Memory (LSTM) neural network is used to predict the arrival and departure of the vehicles. The proposed simulator is implemented in Java and a Software as a Service (SaaS) graphic user interface is designed to analyze and visualize the simulation results. This study finds the active capacity of the parking system, which is defined as the largest number of actively moving vehicles in the parking system under the facility layout. In the system application of the real world testbed, the numerical tests show (a) the smart check-in device has marginal benefits in vehicle waiting time; (b) the flexible pricing policy may increase the average daily revenue if the elasticity of the price is not involved; (c) the number of electrical charging only spots has a negative impact on the performance of the parking facility; and (d) the rear-in only policy may increase the duration of parking maneuvers and reduce the efficiency during the arrival rush hour. Application of the developed simulation system using a real-world case demonstrates its capability of providing informative quantitative measures to support decisions in designing, maintaining, and operating smart parking facilities
- …