3,092 research outputs found

    Clustering driver’s destinations - using internal evaluation to adaptively set parameters

    Get PDF
    With advanced navigation systems becoming ubiquitous in modern cars, the availability of detailed GPS data opens up new research areas in the fields of pattern analysis and data mining. By capturing the end-of-trip GPS points of each trip made by a driver, that driver’s meaningful destinations could be identified. The knowledge of these destinations can be used for route prediction, which in turn can be used for optimizing the motor control to decrease emissions. It can also be used for developing functions for autonomous vehicles. In this thesis a way of extracting these meaningful destinations from GPS data using clustering algorithms has been developed and evaluated. The result is a clustering procedure consisting of 2 steps of clustering. First a pre-clustering to divide the data into subsets corresponding to smaller spatial areas. Then, a refining clustering step for which the parameter of the algorithm is adapted to each subset. Adaptively setting the parameter for each subset is done by testing a set of parameters and evaluating the results internally, with the Silhouette coefficient, and choosing the parameter giving the best evaluation score. The best performing configuration of our procedure, according to our external evaluation method, is in par with the performance of DBSCAN with a supervised choice of parameter setting. Further evaluation of data sets from different areas of the world are needed to draw strong conclusions of the developed procedures performance

    Classifier Design to Improve Pattern Classification and Knowledge Discovery for Imbalanced Datasets

    Get PDF
    Imbalanced dataset mining is a nontrivial issue. It has extensive applications in a variety of fields, such as scientific research, medical diagnosis, business, multiple industries, etc. Standard machine learning algorithms fail to produce satisfactory classifiers: they tend to over-fit the larger class but ignore the smaller class. Numerous algorithms have been developed to handle class imbalance, and limited progress has been achieved in improving prediction accuracy for smaller class. However, real world datasets may have hidden detrimental characteristics other than class imbalance. Those characteristics usually are dataset specific, and can fail otherwise robust algorithms for other imbalanced datasets. Mining such datasets can only be improved by algorithms tailored to domain characteristics (Weiss, 2004); therefore, it is important and necessary to do exploratory data analysis before classifier design. On the other hand, unmet needs in knowledge discovery, such as lead optimization during drug discovery, demand novel algorithms. In this study, we have developed a framework for imbalanced dataset mining tailored to data characteristics and adapted to knowledge discovery in chemical datasets. First, we explored the dataset and visualized domain characteristics, and then we designed different classifiers accordingly: for class imbalance, active learning (AL), cost sensitive learning (CSL) and re-sampling methods were designed; for class overlap, Class Boundary Cleaning (CBC) and Class Boundary Mining (CBM) were developed. CBM was also designed for lead optimization: ideally it would detect fine structural differences between different classes of compounds; and these differences could be options for lead optimization. Methods developed were applied to two datasets, hERG and CPDB. The results from imbalanced hERG liability dataset showed that CBC, CBM and AL were effective in correcting class imbalance/overlap and improving the classifier's performance. Highly predictive models were built; discriminating patterns were discovered; and lead optimization options were proposed. The methodology developed and knowledge discovered will benefit drug discovery, improve hazard test prioritization, risk assessment, and governmental regulatory work on human health and the environmental protection.Doctor of Philosoph

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Can Threshold-Based Sensor Alerts be Analysed to Detect Faults in a District Heating Network?

    Get PDF
    Older IoT “smart sensors” create system alerts from threshold rules on reading values. These simple thresholds are not very flexible to changes in the network. Due to the large number of false positives generated, these alerts are often ignored by network operators. Current state-of-the-art analytical models typically create alerts using raw sensor readings as the primary input. However, as greater numbers of sensors are being deployed, the growth in the number of readings that must be processed becomes problematic. The number of analytic models deployed to each of these systems is also increasing as analysis is broadened. This study aims to investigate if alerts created using threshold rules can be used to predict network faults. By using threshold-based alerts instead of raw continuous readings, the amount of data that the analytic models need to process is greatly reduced. The study was done using alert data from a European city’s District Heating network. The alerts were generated by “smart sensors” that used threshold rules. Analytic models were tested to find the most accurate prediction of a network fault. Work order (maintenance) records were used as the target variable indicating a fault had occurred at the same time and location as the alert was active. The target variable was highly imbalanced (96:4) with a minority class being when a Work Order was required. The decision tree model developed used misclassification costs to achieve a reasonable accuracy with a trade-off between precision (.63) and recall (.56). The sparse nature of the alert data may be to blame for this result. The results show promise that this method could work well on datasets with better sensor coverage
    • 

    corecore