3,125 research outputs found

    Emerging Chemical Patterns for Virtual Screening and Knowledge Discovery

    Get PDF
    The adaptation and evaluation of contemporary data mining methods to chemical and biological problems is one of major areas of research in chemoinformatics. Currently, large databases containing millions of small organic compounds are publicly available, and the need for advanced methods to analyze these data increases. Most methods used in chemoinformatics, e.g. quantitative structure activity relationship (QSAR) modeling, decision trees and similarity searching, depend on the availability of large high-quality training data sets. However, in biological settings, the availability of these training sets is rather limited. This is especially true for early stages of drug discovery projects where typically only few active molecules are available. The ability of chemoinformatic methods to generalize from small training sets and accurately predict compound properties such as activity, ADME or toxicity is thus crucially important. Additionally, biological data such as results from high-throughput screening (HTS) campaigns is heavily biased towards inactive compounds. This bias presents an additional challenge for the adaptation of data mining methods and distinguishes chemoinformatics data from the standard benchmark scenarios in the data mining community. Even if a highly accurate classifier would be available, it is still necessary to evaluate the predictions experimentally. These experiments are both costly and time-consuming and the need to optimize resources has driven the development of integrated screening protocols which try to minimize experimental efforts but still reaching high hit rates of active compounds. This integration, termed “sequential screening” benefits from the complementary nature of experimental HTS and computational virtual screening (VS) methods. In this thesis, a current data mining framework based on class-specific nominal combinations of attributes (emerging patterns) is adapted to chemoinformatic problems and thoroughly evaluated. Combining emerging pattern methodology and the well-known notion of chemical descriptors, emerging chemical patterns (ECP) are defined as class- specific descriptor value range combinations. Each pattern can be thought of as a region in chemical space which is dominated by compounds from one class only. Based on chemical patterns, several experiments are presented which evaluate the performance of pattern-based knowledge mining, property prediction, compound ranking and sequential screening. ECP-based classification is implemented and evaluated on four activity classes for the prediction of compound potency levels. Compared to decision trees and a Bayesian binary QSAR method, ECP-based classification produces high accuracy in positive and negative classes even on the basis of very small training set, a result especially valuable to chemoinformatic problems. The simple nature of ECPs as class-specific descriptor value range combinations makes them easily interpretable. This is used to related ECPs to changes in the interaction network of protein-ligand complexes when the binding conformation is replaced by a computer-modeled conformation in a knowledge mining experiment. ECPs capture well-known energetic differences between binding and energy-minimized conformations and additionally present new insight into these differences on a class level analysis. Finally, the integration of ECPs and HTS is evaluated in simulated lead-optimization and sequential screening experiments. The high accuracy on very small training sets is exploited to design an iterative simulated lead optimization experiment based on experimental evaluation of randomly selected small training sets. In each iteration, all compounds predicted to be weakly active are removed and the remaining compound set is enriched with highly potent compounds. On this basis, a simulated sequential screening experiment shows that ECP-based ranking recovers 19% of available compounds while reducing the “experimental” effort to 0.2%. These findings illustrate the potential of sequential screening protocols and hopefully increase the popularity of this relatively new methodology

    Emergent intertransaction association rules for abnormality detection in intelligent environments

    Get PDF
    This paper is concerned with identifying anomalous behaviour of people in smart environments. We propose the use of emergent transaction mining and the use of the extended frequent pattern tree as a basis. Our experiments on two data sets demonstrate that emergent intertransaction associations are able to detect abnormality present in real world data and that both short and long term behavioural changes can be discovered. The use of intertransaction associations is shown to be advantageous in the detection of temporal associationanomalies otherwise not readily detectable by traditional "market basket" intratransaction mining

    Ассоциативная классификация: аналитический обзор. Часть 2

    Get PDF
    The paper continues the survey of associative classification in context of big data processing. An extended overview and comparative analysis of the modern approaches, models and algorithms developed for associative classification form the main paper contents. In conclusion, the paper outlines the main advantages and drawbacks of associative classification, as well as evaluates its capabilities from big data processing perspective.В работе продолжается рассмотрение основных результатов, моделей и методов, разработанных в области ассоциативной классификации, ориентированных на обработку данных большого объема. Дается анализ подходов, методов и алгоритмов, разработанных в области ассоциативной классификации к настоящему времени. В заключении формулируются достоинства и недостатки ассоциативной классификации как модели машинного обучения, а также дается оценка перспектив ее использования в интеллектуальном анализе больших данных

    A framework for trend mining with application to medical data

    Get PDF
    This thesis presents research work conducted in the field of knowledge discovery. It presents an integrated trend-mining framework and SOMA, which is the application of the trend-mining framework in diabetic retinopathy data. Trend mining is the process of identifying and analysing trends in the context of the variation of support of the association/classification rules that have been extracted from longitudinal datasets. The integrated framework concerns all major processes from data preparation to the extraction of knowledge. At the pre-process stage, data are cleaned, transformed if necessary, and sorted into time-stamped datasets using logic rules. At the next stage, time-stamp datasets are passed through the main processing, in which the ARM technique of matrix algorithm is applied to identify frequent rules with acceptable confidence. Mathematical conditions are applied to classify the sequences of support values into trends. Afterwards, interestingness criteria are applied to obtain interesting knowledge, and a visualization technique is proposed that maps how objects are moving from the previous to the next time stamp. A validation and verification (external and internal validation) framework is described that aims to ensure that the results at the intermediate stages of the framework are correct and that the framework as a whole can yield results that demonstrate causality. To evaluate the thesis, SOMA was developed. The dataset is, in itself, also of interest, as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The Royal Liverpool University Hospital has been a major centre for retinopathy research since 1991. Retinopathy is a generic term used to describe damage to the retina of the eye, which can, in the long term, lead to visual loss. Diabetic retinopathy is used to evaluate the framework, to determine whether SOMA can extract knowledge that is already known to the medics. The results show that those datasets can be used to extract knowledge that can show causality between patients’ characteristics such as the age of patient at diagnosis, type of diabetes, duration of diabetes, and diabetic retinopathy

    Improving the understanding of cancer in a descriptive way: An emerging pattern mining-based approach

    Get PDF
    This paper presents an approach based on emerging pattern mining to analyse cancer through genomic data. Unlike existing approaches, mainly focused on predictive purposes, the proposal aims to improve the understanding of cancer descriptively, not requiring either any prior knowledge or hypothesis to be validated. Additionally, it enables to consider high-order relationships, so not only essential genes related to the disease are considered, but also the combined effect of various secondary genes that can influence different pathways directly or indirectly related to the disease. The prime hypothesis is that splitting genomic cancer data into two subsets, that is, cases and controls, will allow us to determine which genes, and their expressions, are associated with different cancer types. The possibilities of the proposal are demonstrated by analyzing RNA-Seq data for six different types of cancer: breast, colon, lung, thyroid, prostate, and kidney. Some of the extracted insights were already described in the related literature as good cancer bio-markers, while others have not been described yet mainly due to existing techniques are biased by prior knowledge provided by biological databases

    Attribute Oriented Induction High Level Emerging Pattern (AOI-HEP)

    Get PDF
    Attribute-Oriented Induction of High-level Emerging Pattern(AOI-HEP) is a combination of Attribute Oriented Induction (AOI) and Emerging Patterns (EP). AOI is a summarisation algorithm that compact a given dataset into small conceptual descriptions, where each attribute has a defined concept hierarchy. This presents patterns are easily readable and understandable.Emerging patterns are patterns discovered between two datasets and between two time periods such that patterns found in the first dataset have either grown (or reduced) in size, totally disappeared or new ones have emerged. AOI-HEP is not influenced by border-based algorithm like in EP mining algorithms. It is desirable therefore that we obtain summarised emerging patterns between two datasets. We propose High-level Emerging Pattern (HEP) algorithm. The main purpose of combining AOI and EP is to use the typical strength of AOI and EP to extract important high-level emerging patterns from data. The AOI characteristic rule algorithm was run twice with two input datasets,to create two rulesets which are then processed with the HEP algorithm. Firstly, the HEP algorithm starts with cartesian product between two rulesets which eliminates rules in rulesets by computing similarity metric (a categorization of attribute comparisons). Secondly, the output rules between two rulesets from the metric similarity are discriminated by computing a growth rate value to find ratio of supports between rules from two rulesets. The categorization of attribute comparisons is based on similarity hierarchy level. The categorisation of attributes was found to be with three options in how they subsume each other. These were Total Subsumption HEP (TSHEP), Subsumption Overlapping HEP (SOHEP) and Total Overlapping HEP (TOHEP) patterns. Meanwhile, from certain similarity hierarchy level and values, we can mine frequent and similar patterns that create discriminant rules. We used four large real datasets from UCI machine learning repository and discovered valuable HEP patterns including strong discriminant rules, frequent and similar patterns. Moreover, the experiments showed that most datasets have SOHEP but not TSHEP and TOHEP and the most rarely found were TOHEP. Since AOI- iii HEP can strongly discriminate high-level data, assuredly AOI-HEP can be implemented to discriminate datasets such as finding bad and good customers for banking loan systems or credit card applicants etc. Moreover, AOI-HEP can be implemented to mine similar patterns, for instance, mining similar customer loan patterns etc
    corecore