267 research outputs found

    Graph-based Modelling of Concurrent Sequential Patterns

    Get PDF
    Structural relation patterns have been introduced recently to extend the search for complex patterns often hidden behind large sequences of data. This has motivated a novel approach to sequential patterns post-processing and a corresponding data mining method was proposed for Concurrent Sequential Patterns (ConSP). This article refines the approach in the context of ConSP modelling, where a companion graph-based model is devised as an extension of previous work. Two new modelling methods are presented here together with a construction algorithm, to complete the transformation of concurrent sequential patterns to a ConSP-Graph representation. Customer orders data is used to demonstrate the effectiveness of ConSP mining while synthetic sample data highlights the strength of the modelling technique, illuminating the theories developed

    OLEMAR: An Online Environment for Mining Association Rules in Multidimensional Data

    Get PDF
    Data warehouses and OLAP (online analytical processing) provide tools to explore and navigate through data cubes in order to extract interesting information under different perspectives and levels of granularity. Nevertheless, OLAP techniques do not allow the identification of relationships, groupings, or exceptions that could hold in a data cube. To that end, we propose to enrich OLAP techniques with data mining facilities to benefit from the capabilities they offer. In this chapter, we propose an online environment for mining association rules in data cubes. Our environment called OLEMAR (online environment for mining association rules), is designed to extract associations from multidimensional data. It allows the extraction of inter-dimensional association rules from data cubes according to a sum-based aggregate measure, a more general indicator than aggregate values provided by the traditional COUNT measure. In our approach, OLAP users are able to drive a mining process guided by a meta-rule, which meets their analysis objectives. In addition, the environment is based on a formalization, which exploits aggregate measures to revisit the definition of the support and the confidence of discovered rules. This formalization also helps evaluate the interestingness of association rules according to two additional quality measures: lift and loevinger. Furthermore, in order to focus on the discovered associations and validate them, we provide a visual representation based on the graphic semiology principles. Such a representation consists in a graphic encoding of frequent patterns and association rules in the same multidimensional space as the one associated with the mined data cube. We have developed our approach as a component in a general online analysis platform called Miningcubes according to an Apriori-like algorithm, which helps extract inter-dimensional association rules directly from materialized multidimensional structures of data. In order to illustrate the effectiveness and the efficiency of our proposal, we analyze a real-life case study about breast cancer data and conduct performance experimentation of the mining process

    On the Nature and Types of Anomalies: A Review

    Full text link
    Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review comments will be appreciated. Improvements in version 2: Explicit mention of fifth anomaly dimension; Added section on explainable anomaly detection; Added section on variations on the anomaly concept; Various minor additions and improvement

    On Cognitive Preferences and the Plausibility of Rule-based Models

    Get PDF
    It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpretability, namely the plausibility of models. Roughly speaking, we equate the plausibility of a model with the likeliness that a user accepts it as an explanation for a prediction. In particular, we argue that, all other things being equal, longer explanations may be more convincing than shorter ones, and that the predominant bias for shorter models, which is typically necessary for learning powerful discriminative models, may not be suitable when it comes to user acceptance of the learned models. To that end, we first recapitulate evidence for and against this postulate, and then report the results of an evaluation in a crowd-sourcing study based on about 3.000 judgments. The results do not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then relate these results to well-known cognitive biases such as the conjunction fallacy, the representative heuristic, or the recogition heuristic, and investigate their relation to rule length and plausibility.Comment: V4: Another rewrite of section on interpretability to clarify focus on plausibility and relation to interpretability, comprehensibility, and justifiabilit

    Information Granulation for the Design of Granular Information Retrieval Systems

    Get PDF
    With the explosive growth of the amount of information stored on computer networks such as the Internet, it is increasingly more difficult for information seekers to retrieve relevant information. Traditional document ranking functions employed by Internet search engines can be enhanced to improve the effectiveness of information retrieval (IR). This paper illustrates the design and development of a granular IR system to facilitate domain specific search. In particular, a novel computational model is designed to rank documents according the searchers’ specific granularity requirements. The initial experiments confirm that our granular IR system outperforms a classical vector-based IR system. In addition, user-based evaluations also demonstrate that our granular IR system is effective when compared with a well-known Internet search engine. Our research work opens the door to the design and development of the next generation of Internet search engines to alleviate the problem of information overload

    Massively Parallel Single-Source SimRanks in o(logn)o(\log n) Rounds

    Full text link
    SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data management tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node ss and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems such as MapReduce, Hadoop, or Spark. Existing distributed SimRank algorithms enforce either Ω(logn)\Omega(\log n) communication round complexity or Ω(n)\Omega(n) machine space for a graph of nn nodes. We overcome this barrier. Particularly, given a graph of nn nodes, for any query node vv and constant error ϵ>3n\epsilon>\frac{3}{n}, we show that using O(log2logn)O(\log^2 \log n) rounds of communication among machines is almost enough to compute single-source SimRank values with at most ϵ\epsilon absolute errors, while each machine only needs a space sub-linear to nn. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the Θ(logn)\Theta(\log n) round complexity barrier with provable result accuracy

    Extraction de connaissances d'adaptation par analyse de la base de cas

    Get PDF
    En raisonnement à partir de cas, l'adaptation d'un cas source pour résoudre un problème cible est une étape à la fois cruciale et difficile à réaliser. Une des raisons de cette difficulté tient au fait que les connaissances d'adaptation sont généralement dépendantes du domaine d'application. C'est ce qui motive la recherche sur l'acquisition de connaissances d?adaptation (ACA). Cet article propose une approche originale de l'ACA fondée sur des techniques d'extraction de connaissances dans des bases de données (ECBD). Nous présentons CABAMAKA, une application qui réalise l'ACA par analyse de la base de cas, en utilisant comme technique d'apprentissage l'extraction de motifs fermés fréquents. L'ensemble du processus d'extraction des connaissances est détaillé, puis nous examinons comment organiser les résultats obtenus de façon à faciliter la validation des connaissances extraites par l'analyste
    corecore