15 research outputs found

    Mining Sequential Relations from Multidimensional Data Sequence for Prediction

    Get PDF
    By analyzing historical data sequences and identifying relations between the occurring of data items and certain types of business events we have opportunities to gain insights into future status and thereby take action proactively. This paper proposes a new approach to cope with the problem of prediction on data sequence characterized by multiple dimensions. The proposed relation mining approach improves the existing sequential pattern mining algorithm by considering multidimensional data sequences and incorporating time constraints. We demonstrate that multidimensional relations extracted by our approach are an enhancement of single dimensional relations by showing significantly stronger prediction capability, despite of the substantial work done in the latter area. In addition, matching algorithm based on the obtained relations is proposed to make prediction. The effectiveness of the proposed methods is validated by experiments conducted on a mobile user context dataset

    Time-weighted multi-touch attribution and channel relevance in the customer journey to online purchase

    Get PDF
    We address statistical issues in attributing revenue to marketing channels and inferring the importance of individual channels in customer journeys towards an online purchase. We describe the relevant data structures and introduce an example. We suggest an asymmetric bathtub shape as appropriate for time-weighted revenue attribution to the customer journey, provide an algorithm, and illustrate the method. We suggest a modification to this method when there is independent information available on the relative values of the channels. To infer channel importance, we employ sequential data analysis ideas and restrict to data which ends in a purchase. We propose metrics for source, intermediary, and destination channels based on twoand three-step transitions in fragments of the customer journey. We comment on the practicalities of formal hypothesis testing. We illustrate the ideas and computations using data from a major UK online retailer. Finally, we compare the revenue attributions suggested by the methods in this paper with several common attribution methods

    Frequent pattern mining for kernel trace data

    Full text link

    A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.</p> <p>Methods</p> <p>In this work, we propose <it>e</it>-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.</p> <p>Results</p> <p>We present results in real data showing the effectiveness of <it>e</it>-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in <it>Saccharomyces cerevisiae </it>in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.</p> <p>Discussion</p> <p>The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.</p> <p>Availability</p> <p>A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in <url>http://kdbio.inesc-id.pt/software/e-ccc-biclustering</url>.</p

    Data analytics and visualization for enhanced highway construction cost indexes and as-built schedules

    Get PDF
    A considerable amount of digital data is being collected by State Highway Agencies (SHAs) to aid project-planning activities, support various project level decision-making processes, and effectively maintain and operate constructed highway assets. However, the highway construction industry has been significantly lagging behind utilizing the growing digital data to support business decisions compared to other industry sectors such as health care and energy. The significant lack of understanding on the linkage between raw data collected and various decisions, proper computational methodologies, and effective guidance is considered as major barriers to the full utilization of the digital data. This study uses digital datasets that are now commonly available in SHAs, to demonstrate the smart utilization of existing digital data to support and enhance decision-making processes using data analytics and visualization methods. This study will a) develop an advanced computational methodology to generate multidimensional highway construction cost indexes (HCCIs) using two new concepts of i) dynamic item basket and ii) multidimensional HCCI, b) develop an enhanced framework for collection and utilization of digital Daily work Report (DWR) data, c) develop an automated methodology to generate as-built schedules using data collected from existing DWR systems, and d) analyze as-built schedules to develop a knowledge base of frequent precedence relationships of activities. The study achieves those objectives by utilizing three digital datasets: bid data, DWR data, and project characteristics data. Further, two standalone prototype systems, namely, Dyna-Mu-HCCI and ABSS are developed to automate computational methodologies for multidimensional HCCI calculation and as-built schedule development respectively. This study will aid SHAs to utilize currently unused datasets for informed budgeting and project control decisions. It demonstrates the importance of data analytics and visualization to obtain more value from the investment made in collecting construction data. Overall, this study serves as a step in making a transition from experience driven to data driven decision making in the construction industry

    The evaluation of occupational accident with sequential pattern mining

    Get PDF
    Accidents in manufacturing systems greatly affect productivity and efficiency, which are well known perfor-mance indicaters in practice. Therefore, it is very important to know the sequential patterns among the accidents to avode possible losses decrasing performance of the manufacturing systems. In order to reduce accidents, it is necessary to determine the patterns that cause the accident first. The associations among the causes of the occurrence of accidents is rarely investigated in the literature. To fill this gap, the patterns of causes among the accidents in the manufacturing system are revealed by using sequential pattern mining in this study. The most important contribution of this study is the discovery of sequential patterns formed by accident characteristics of pre-accident, moment of accident and post-accident stages unlike traditional accident investigation methods. Additionally, knowing the patterns of causes among the accidents can help decision makers to prepare a more proactive security program in real life. The CloFast algorithm is performed to go into the details of accidents in manufacturing systems. Accident records induding data between 2013 and 2019 are used to discover the sequential patterns. The results of this study showed that each accidents has its own sequential accident patterns and it is also posible to prevent possible accidents and reduce losses due to accidents considering sequential patterns in real life. Safety engineers and occupational safety specialists should take into account the sequential patterns among the accidents to avoid similar accident in the near future

    Proceso de extracción de patrones secuenciales para la caracterización de fenómenos espacio-temporales

    Get PDF
    El objetivo de este trabajo de fin de carrera es realizar un proceso de extracción de patrones secuenciales basado en KDD, empleando el algoritmo de minería de patrones secuenciales PrefixSpan para prever el comportamiento de fenómenos representados por eventos que cambian con el tiempo y el espacio. Estos tipos de fenómenos son llamados fenómenos espacio-temporales, los cuales son un conjunto de eventos o hechos perceptibles por el hombre. Además, están compuestos por un componente espacial (la ubicación donde sucede el fenómeno), un componente temporal (el momento o intervalo de tiempo en el que ocurre el fenómeno) y un componente de análisis (el conjunto de características que describen el comportamiento del fenómeno). En el mundo, se pueden observar una gran diversidad de fenómenos espaciotemporales; sin embargo, el presente trabajo de fin de carrera se centra en los fenómenos naturales, tomando como caso de prueba el fenómeno espacio-temporal de la contaminación de los ríos en Reino Unido. Por lo tanto, con el fin de realizar un estudio completo sobre este fenómeno, se utiliza KDD (Knowledge Discovery in Databases) para la extracción del conocimiento a través de la generación de patrones novedosos y útiles dentro de esquemas sistemáticos complejos. Además, se utilizan métodos de Minería de Datos para extraer información útil a partir de grandes conjuntos de datos. Así mismo, se utilizan patrones secuenciales, los cuales son eventos frecuentes que ocurren en el tiempo y que permiten descubrir correlaciones entre eventos y revelar relaciones de “antes” y “después”. En resumen, el presente trabajo de fin de carrera se trata de un proceso para mejorar el estudio del comportamiento de los fenómenos gracias al uso de patrones secuenciales. De esta manera, se brinda una alternativa adicional para mejorar el entendimiento de los fenómenos espacio-temporales; y a su vez, el conocimiento previo de sus factores causantes y consecuentes que se puedan desencadenar, lo cual permitiría lanzar alertas tempranas ante posibles acontecimientos atípicos.Tesi
    corecore