25,272 research outputs found

    Text Classification Using Association Rules, Dependency Pruning and Hyperonymization

    Full text link
    We present new methods for pruning and enhancing item- sets for text classification via association rule mining. Pruning methods are based on dependency syntax and enhancing methods are based on replacing words by their hyperonyms of various orders. We discuss the impact of these methods, compared to pruning based on tfidf rank of words.Comment: 16 pages, 2 figures, presented at DMNLP 201

    Associative pattern mining for supervised learning

    Get PDF
    The Internet era has revolutionized computational sciences and automated data collection techniques, made large amounts of previously inaccessible data available and, consequently, broadened the scope of exploratory computing research. As a result, data mining, which is still an emerging field of research, has gained importance because of its ability to analyze and discover previously unknown, hidden, and useful knowledge from these large amounts of data. One aspect of data mining, known as frequent pattern mining, has recently gained importance due to its ability to find associative relationships among the parts of data, thereby aiding a type of supervised learning known as associative learning . The purpose of this dissertation is two-fold: to develop and demonstrate supervised associative learning in non-temporal data for multi-class classification and to develop a new frequent pattern mining algorithm for time varying (temporal) data which alleviates the current issues in analyzing this data for knowledge discovery. In order to use associative relationships for classification, we have to algorithmically learn their discriminatory power. While it is well known that multiple sets of features work better for classification, we claim that the isomorphic relationships among the features work even better and, therefore, can be used as higher order features. To validate this claim, we exploit these relationships as input features for classification instead of using the underlying raw features. The next part of this dissertation focuses on building a new classifier using associative relationships as a basis for the multi-class classification problem. Most of the existing associative classifiers represent the instances from a class in a row-based format wherein one row represents features of one instance and extract association rules from the entire dataset. The rules formed in this way are known as class constrained rules, as they have class labels on the right side of the rules. We argue that this class constrained representation schema lacks important information that is necessary for multi-class classification. Further, most existing works use either the intraclass or inter-class importance of the association rules, both of which sets of techniques offer empirical benefits. We hypothesize that both intra-class and inter-class variations are important for fast and accurate multi-class classification. We also present a novel weighted association rule-based classification mechanism that uses frequent relationships among raw features from an instance as the basis for classifying the instance into one of the many classes. The relationships are weighted according to both their intra-class and inter-class importance. The final part of this dissertation concentrates on mining time varying data. This problem is known as inter-transaction association rule mining in the data-mining field. Most of the existing work transforms the time varying data into a static format and then use multiple scans over the new data to extract patterns. We present a unique index-based algorithmic framework for inter-transaction association rule mining. Our proposed technique requires only one scan of the original database. Further, the proposed technique can also provide the location information of each extracted pattern. We use mathematical induction to prove that the new representation scheme captures all underlying frequent relationships

    Extended Apriori for association rule mining: Diminution based utility weightage measuring approach

    Get PDF
    The field of Association rule mining is a dynamic area for innovation of knowledge through which uncountable procedures have been expounded. Recently, by including significant components viz. value (utility), volume of items (weight) etc, the researchers have enhanced the quality of association rule mining for industry by bringing out the association designs. In this note, a proficient methodology has been put forward based on weight factor and utility for effective digging out of important association rules. At the very beginning, a traditional Apriori algorithm has been utilized that make use of the anti-monotone property which states that if n items are recurring continuously then n-1 items should also recur by which the scores of weightage(W-Gain), utility(U-Gain) and diminution(D-sum), are derived at. Eventually, we derive a subset of important association rules through which EUW-Score is generated. The tentative outcome demonstrates the effectiveness of the methodology in generating high utility association rules that is profitably used for the business improvement

    CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

    Full text link
    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201

    Data Mining in Electronic Commerce

    Full text link
    Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    MINING TOP-K FREQUENT SEQUENTIAL PATTERN IN ITEM INTERVAL EXTENDED SEQUENCE DATABASE

    Get PDF
    Abstract. Frequent sequential pattern mining in item interval extended sequence database (iSDB) has been one of interesting task in recent years. Unlike classic frequent sequential pattern mining, the pattern mining in iSDB also consider the item interval between successive items; thus, it may extract more meaningful sequential patterns in real life. Most previous frequent sequential pattern mining in iSDB algorithms needs a minimum support threshold (minsup) to perform the mining. However, it’s not easy for users to provide an appropriate threshold in practice. The too high minsup value will lead to missing valuable patterns, while the too low minsup value may generate too many useless patterns. To address this problem, we propose an algorithm: TopKWFP – Top-k weighted frequent sequential pattern mining in item interval extended sequence database. Our algorithm doesn’t need to provide a fixed minsup value, this minsup value will dynamically raise during the mining proces
    corecore