272,891 research outputs found

    Being a binding site: Characterizing residue composition of binding sites on proteins

    Get PDF
    The Protein Data Bank contains the description of more than 45,000 three-dimensional protein and nucleic-acid structures today. Started to exist as the computer-readable depository of crystallographic data complementing printed articles, the proper interpretation of the content of the individual files in the PDB still frequently needs the detailed information found in the citing publication. This fact implies that the fully automatic processing of the whole PDB is a very hard task. We first cleaned and re-structured the PDB data, then analyzed the residue composition of the binding sites in the whole PDB for frequency and for hidden association rules. Main results of the paper: (i) the cleaning and repairing algorithm (ii) redundancy elimination from the data (iii) application of association rule mining to the cleaned non-redundant data set. We have found numerous significant relations of the residue-composition of the ligand binding sites on protein surfaces, summarized in two figures. One of the classical data-mining methods for exploring implication-rules, the association-rule mining, is capable to find previously unknown residue-set preferences of bind ligands on protein surfaces. Since protein-ligand binding is a key step in enzymatic mechanisms and in drug discovery, these uncovered preferences in the study of more than 19,500 binding sites may help in identifying new binding protein-ligand pairs

    Stock Selection System: Building Long/Short Portfolios Using Intraday Patterns

    Get PDF
    AbstractWe present the architecture of a complete intraday trading management system using a stock selection algorithm for building long/short portfolios. Our approach for stock selection is based on knowledge discovery in large databases technologies; more specifically, we build techniques that allow one to discover hidden price patterns which are capable of predicting the direction of stock prices. The base of the whole system is a novel pattern mining algorithm from time-series data, which involves highly compute-intensive aggregation calculations as complex but efficient distributed SQL queries on the relational databases that store the time-series. This algorithm follows similar path to Association Rule Mining algorithms (such as the Apriori and FP-Growth algorithms) but with major differences. First, it has one step more at initialization which uses a rule-based generator algorithm that transforms relations and data structures in a binary string format. These patterns contain mixed information of small price patterns (3, 4 or 5 candlesticks) and trading signals/filters produced by technical indicators. As output, instead of generating association rules, it produces probabilities (varies between 60% and 95%) about the future stock price direction such as “within next 5 periods, price will increase (or decrease) more than 2%”. Then, the stock selection system exploits all this information and uses it to find those stocks that appear to form patterns which have the highest accuracy in predicting prices. Each time period (finest interval length being 10minutes), the stock selector system, having the support of trading systems, examines and decides which stocks should be replaced in current portfolio by other more profitable stocks. The system testing, which produced interested results, is made of portfolio of 10 stocks having open long position and 5 stocks having open short positions

    FP-tree and COFI Based Approach for Mining of Multiple Level Association Rules in Large Databases

    Full text link
    In recent years, discovery of association rules among itemsets in a large database has been described as an important database-mining problem. The problem of discovering association rules has received considerable research attention and several algorithms for mining frequent itemsets have been developed. Many algorithms have been proposed to discover rules at single concept level. However, mining association rules at multiple concept levels may lead to the discovery of more specific and concrete knowledge from data. The discovery of multiple level association rules is very much useful in many applications. In most of the studies for multiple level association rule mining, the database is scanned repeatedly which affects the efficiency of mining process. In this research paper, a new method for discovering multilevel association rules is proposed. It is based on FP-tree structure and uses cooccurrence frequent item tree to find frequent items in multilevel concept hierarchy.Comment: Pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947 5500, http://sites.google.com/site/ijcsis

    New Approaches to Frequent and Incremental Frequent Pattern Mining

    Full text link
    Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and confidence. Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data finding its way into data storages in perhaps every day, made many researchers look for efficient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not sufficient. Hence, incremental algorithms have been developed and used to improve the efficiency of frequent pattern mining. One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not fit in memory then page swapping will occur which will increase the running time of the algorithms). In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way. We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more efficient version UWEP2, which reduces the number of candidates created and the number of database scans. We have run all of our tests against three datasets with different features for different minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other. While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well
    corecore