4,063 research outputs found

    DOMAIN ABSTRACTION OF HIGHLY CORRELATED PAIRS TO RECOMMEND IN THE LONG TAIL

    Get PDF
    ABSTRACTAmong difficulties encountered by modern shopping recommenders is the long tail shape of sold items also related to cold-start issues. Various approaches including content-based recommendations attempt to overcome this problem that has serious impact on the accuracy of recommendations especially when new products are continuously added to the catalogue. This paper investigates the use of an algorithm to search for highly correlated pairs between abstractions of items. The advantage of this approach is evaluated on the basis of real data showing better results compared to an approach onlybased on the concrete pairs of items. Using rigorous protocols such as Given-n, experimental results show significant improvement in both the recommendation accuracy and the recommendation of products in the long tail.Keywords. Knowledge Discovery, Mining Correlated Pairs, Recommender Systems

    New probabilistic interest measures for association rules

    Full text link
    Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic

    Frequent Itemset Mining for Big Data

    Get PDF
    Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets

    Identifying collaborations among researchers: a pattern-based approach

    Get PDF
    In recent years a huge amount of publications and scientific reports has become available through digital libraries and online databases. Digital libraries commonly provide advanced search interfaces, through which researchers can find and explore the most related scientific studies. Even though the publications of a single author can be easily retrieved and explored, understanding how authors have collaborated with each other on specific research topics and to what extent their collaboration have been fruitful is, in general, a challenging task. This paper proposes a new pattern-based approach to analyzing the correlations among the authors of most influential research studies. To this purpose, it analyzes publication data retrieved from digital libraries and online databases by means of an itemset-based data mining algorithm. It automatically extracts patterns representing the most relevant collaborations among authors on specific research topics. Patterns are evaluated and ranked according to the number of citations received by the corresponding publications. The proposed approach was validated in a real case study, i.e., the analysis of scientific literature on genomics. Specifically, we first analyzed scientific studies on genomics acquired from the OMIM database to discover correlations between authors and genes or genetic disorders. Then, the reliability of the discovered patterns was assessed using the PubMed search engine. The results show that, for the majority of the mined patterns, the most influential (top ranked) studies retrieved by performing author-driven PubMed queries range over the same gene/genetic disorder indicated by the top ranked pattern

    Twitter data analysis by means of Strong Flipping Generalized Itemsets

    Get PDF
    Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis ofTwitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Aimed at addressing this issue, generalized itemsets - sets of items at different abstraction levels - can be effectively mined and used todiscover interesting multiple-level correlations among data supplied with taxonomies. Each generalizeditemset is characterized by a correlation type (positive, negative, or null) according to the strength of thecorrelation among its items.This paper presents a novel data mining approach to supporting different and interesting targetedanalysis - topic trend analysis, context-aware service profiling - by analyzing Twitter posts. We aim atdiscovering contrasting situations by means of generalized itemsets. Specifically, we focus on comparingitemsets discovered at different abstraction levels and we select large subsets of specific (descendant)itemsets that show correlation type changes with respect to their common ancestor. To this aim, a novelkind of pattern, namely the Strong Flipping Generalized Itemset (SFGI), is extracted from Twitter mes-sages and contextual information supplied with taxonomy hierarchies. Each SFGI consists of a frequentgeneralized itemset X and the set of its descendants showing a correlation type change with respect to X. Experiments performed on both real and synthetic datasets demonstrate the effectiveness of the pro-posed approach in discovering interesting and hidden knowledge from Twitter dat

    Web-based Text Mining

    Get PDF
    Text mining deals with retrieval of specific information provided by customer search engines. With the massive amount of information that is available on the World Wide Web, text mining provides results in the order of highest relevance to the key words in the query. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. For example, it is much more difficult to graphically display textual content than quantitative data. In this paper we describe a method for choosing a subset of the Web, an approach to create a search a flexible service to adopt a new way to generate highly effective results for expert searches. Retrieval of information poses the problem of redundancy in retrieval of same data repeatedly. This paper presents an optimized solution for fast recovery of data and also finds methods for regenerating the queries from the queries posed

    Grouping related attributes

    Get PDF
    Grouping objects that are described by attributes, or clustering is a central notion in data mining. On the other hand, similarity or relationships between attributes themselves is equally important but relatively unexplored. Such groups of attributes are also known as directories, concept hierarchies or topics depending on the underlying data domain. The similarities between the two problems of grouping objects and attributes might suggest that traditional clustering techniques are applicable. This thesis argues that traditional clustering techniques fail to adequately capture the solution we seek. It also explores domain-independent techniques for grouping attributes. The notion of similarity between attributes and therefore clustering in categorical datasets has not received adequate attention. This issue has seen renewed interest in the knowledge discovery community, spurred on by the requirements of personalization of information and online search technology. The problem is broken down into (a) quantification of this notion of similarity and (b) the subsequent formation of groups, retaining attributes similar enough in the same group based on metrics that we will attempt to derive. Both aspects of the problem are carefully studied. The thesis also analyzes existing domainindependent approaches to building distance measures, proposing and analyzing iii several such measures for quantifying similarity, thereby providing a foundation for future work in grouping relevant attributes. The theoretical results are supported by experiments carried out on a variety of datasets from the text-mining, web-mining, social networks and transaction analysis domains. The results indicate that traditional clustering solutions are inadequate within this problem framework. They also suggest a direction for the development of distance measures for the quantification of the concept of similarity between categorical attributes

    A New Extraction Optimization Approach to Frequent 2 Item sets

    Get PDF
    International Journal on Computational Science Applications (IJCSA) ISSN : 2200 – 0011 https://wireilla.com/ijcsa/index.html Current Issue Article Title: A New Extraction Optimization Approach to Frequent 2 Item sets Abstract In this paper, we propose a new optimization approach to the APRIORI reference algorithm (AGR 94) for 2-itemsets (sets of cardinal 2). The approach used is based on two-item sets. We start by calculating the 1- itemets supports (cardinal 1 sets), then we prune the 1-itemsets not frequent and keep only those that are frequent (ie those with the item sets whose values are greater than or equal to a fixed minimum threshold). During the second iteration, we sort the frequent 1-itemsets in descending order of their respective supports and then we form the 2-itemsets. In this way the rules of association are discovered more quickly. Experimentally, the comparison of our algorithm OPTI2I with APRIORI, PASCAL, CLOSE and MAXMINER, shows its efficiency on weakly correlated data. Our work has also led to a classical model of sideby-side classification of items that we have obtained by establishing a relationship between the different sets of 2-itemsets. Keywords Optimization, Frequent Itemsets, Association Rules, Low-Correlation Data, Supports For More Details: https://wireilla.com/papers/ijcsa/V9N2/9219ijcsa01.pd
    • …
    corecore