23 research outputs found

    Finding Associations and Computing Similarity via Biased Pair Sampling

    Full text link
    This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract: Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for "interesting similarity/confidence" is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE International Conference on Data Mining, 2009. The conference version is (c) 2009 IEE

    A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

    Get PDF
    Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with that obtained by applying two other missing data handling methods, the constant default value substitution and the missing data ignorance (non-substitution) methods. The experiment results provided a valuable insight to the improvement of the accuracy for data discrimination and knowledge discovery on large data sets containing missing values

    Knowledge Discovery in Databases: An Information Retrieval Perspective

    Get PDF
    The current trend of increasing capabilities in data generation and collection has resulted in an urgent need for data mining applications, also called knowledge discovery in databases. This paper identifies and examines the issues involved in extracting useful grains of knowledge from large amounts of data. It describes a framework to categorise data mining systems. The author also gives an overview of the issues pertaining to data pre processing, as well as various information gathering methodologies and techniques. The paper covers some popular tools such as classification, clustering, and generalisation. A summary of statistical and machine learning techniques used currently is also provided

    An evolutionary model to mine high expected utility patterns from uncertain databases

    Get PDF
    In recent decades, mobile or the Internet of Thing (IoT) devices are dramatically increasing in many domains and applications. Thus, a massive amount of data is generated and produced. Those collected data contain a large amount of interesting information (i.e., interestingness, weight, frequency, or uncertainty), and most of the existing and generic algorithms in pattern mining only consider the single object and precise data to discover the required information. Meanwhile, since the collected information is huge, and it is necessary to discover meaningful and up-to-date information in a limit and particular time. In this paper, we consider both utility and uncertainty as the majority objects to efficiently mine the interesting high expected utility patterns (HEUPs) in a limit time based on the multi-objective evolutionary framework. The benefits of the designed model (called MOEA-HEUPM) can discover the valuable HEUPs without pre-defined threshold values (i.e., minimum utility and minimum uncertainty) in the uncertain environment. Two encoding methodologies are also considered in the developed MOEA-HEUPM to show its effectiveness. Based on the developed MOEA-HEUPM model, the set of non-dominated HEUPs can be discovered in a limit time for decision-making. Experiments are then conducted to show the effectiveness and efficiency of the designed MOEA-HEUPM model in terms of convergence, hypervolume and number of the discovered patterns compared to the generic approaches.acceptedVersio

    A BELIEF-DRIVEN DISCOVERY FRAMEWORK BASED ON DATA MONITORING AND TRIGGERING

    Get PDF
    A new knowledge-discovery framework, called Data Monitoring and Discovery Triggering (DMDT), is defined, where the user specifies monitors that 芒watch" for significant changes to the data and changes to the user-defined system of beliefs. Once these changes are detected, knowledge discovery processes, in the form of data mining queries, are triggered. The proposed framework is the result of an observation, made in the previous work of the authors, that when changes to the user-defined beliefs occur, this means that, there are interesting patterns in the data. In this paper, we present an approach for finding these interesting patterns using data monitoring and belief-driven discovery techniques. Our approach is especially useful in those applications where data changes rapidly with time, as in some of the On-Line Transaction Processing (OLTP) systems. The proposed approach integrates active databases, data mining queries and subjective measures of interestingness based on user-defined systems of beliefs in a novel and synergetic way to yield a new type of data mining systems.Information Systems Working Papers Serie

    Visualizaci贸n de grandes vol煤menes de datos

    Get PDF
    La disponibilidad de almacenamiento econ贸mico y el progreso tecnol贸gico, han llevado a que se hayan creado inmensas bases de datos de negocios, de datos cient铆ficos, de datos meteorol贸gicos entre otros tipos de datos. Ante el crecimiento tan vertiginoso en la cantidad de informaci贸n de estas bases de datos y a煤n cuando las personas est茅n acostumbradas a interrogarlas, se hace pr谩cticamente imposible para una persona la tarea de explorarlas para poder extraer conclusiones, tendencias y patrones. En este caso, sin duda los problemas de la consulta y la posterior exploraci贸n de las bases de datos son problemas clave. Con el objetivo de colaborar en la soluci贸n de los mismos se han desarrollado distintas herramientas de visualizaci贸n. Entre las primeras propuestas para la visualizaci贸n de este tipo de informaci贸n, surgen m茅todos interactivos basados en t茅cnicas de browsing, de filtros y de facilidades para la construcci贸n de consultas din谩micas que permitan aprender de los datos a trav茅s de m煤ltiples consultas. Las propuestas de investigaci贸n m谩s ambiciosas y recientes son las de data mining visual y est谩n vinculadas con una nueva visi贸n de la informaci贸n en grandes bases de datos. Se pretende la b煤squeda de nuevos conocimientos o profundizaci贸n del discernimiento de conocimientos existentes, a trav茅s de un esfuerzo cooperativo entre el hombre y la computadora. Se basan en algoritmos de clustering guiados con t茅cnicas de visualizaci贸n interactivas para descubrir comportamientos y tendencias en los datos. Informalmente, visualizaci贸n es la transformaci贸n de datos o informaci贸n en im谩genes o pinturas. La visualizaci贸n emplea el aparato sensitivo primario humano, que es la visi贸n, tanto como todo el poder de procesamiento de la mente humana. El resultado debe ser un medio simple y efectivo para comunicar informaci贸n voluminosa y compleja. En este contexto, el objetivo de nuestro trabajo consiste en delinear criterios con el objetivo de obtener una visualizaci贸n efectiva de grandes bases de datos en equipos de bajo costo. En este trabajo presentamos una descripci贸n de la investigaci贸n realizada sobre las tendencias y herramientas que se est谩n utilizando para el discernimiento de grandes vol煤menes de datos en Ciencias de la Computaci贸n. En la secci贸n siguiente se detallan los conceptos fundamentales involucrados en visualizaci贸n, visualizaci贸n de informaci贸n, data mining visual y c贸mo se relacionan a trav茅s de la representaci贸n gr谩fica de los datos. En la secci贸n siguiente se presentan algunos ejemplos de visualizaci贸n de bases de datos y se concluye con una descripci贸n del trabajo a realizar.Eje: Computaci贸n gr谩fica. Visualizaci贸nRed de Universidades con Carreras en Inform谩tica (RedUNCI

    Interactive visual exploration of association rules with rule-focusing methodology

    Get PDF
    International audienceOn account of the enormous amounts of rules that can be produced by data mining algorithms, knowledge post-processing is a difficult stage in an association rule discovery process. In order to find relevant knowledge for decision making, the user (a decision maker specialized in the data studied) needs to rummage through the rules. To assist him/her in this task, we here propose the rule-focusing methodology, an interactive methodology for the visual post-processing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets. This new approach relies on rule interestingness measures, on a visual representation, and on interactive navigation among the rules. We have implemented the rule-focusing methodology in a prototype system called ARVis. It exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm
    corecore