371,608 research outputs found

    Statistical models for time sequences data mining

    Get PDF
    In this paper, we present an adaptive modelling technique for studying past behaviors of objects and predicting the near future events. Our approach is to define a sliding window (of different window sizes) over a time sequence and build autoregression models from subsequences in different windows. The models are representations of past behaviors of the sequence objects. We can use the AR coefficients as features to index subsequences to facilitate the query of subsequences with similar behaviors. We can use a clustering algorithm to group time sequences on their similarity in the feature space. We can also use the AR models for prediction within different windows. Our experiments show that the adaptive model can give better prediction than non-adaptive models.published_or_final_versio

    Explicit probabilistic models for databases and networks

    Full text link
    Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.Comment: Submitte

    Time series forecasting on crime data in Amsterdam for a software company

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn recent years, there have been many discussions of data mining technology implementation in the fight against terrorism and crime. Sentient as a software company has been supporting the police for years by applying data mining techniques in the DataDetective application (Sentient, 2017). Experimenting with various types of predictive model solutions, selecting the most efficient and promising solution are the objectives of this internship. Initially, extended literatures were reviewed in the field of data mining, crime analysis and crime data mining. Sentient provided 7 years of crime data which was aggregated on daily basis to create a univariate dataset. Also, an incidence type daily aggregation was done to create a multivariate dataset. The prediction length for each solution was 7 days. The experiments were divided into two major categories: Statistical models and neural network models. Neural networks outperformed statistical models for the crime data. This paper provides the overview of statistical models and neural network models used. A comparative study of all the models on similar dataset gives a clear picture of their performance on available data and generalization capability. Evidently, the experiments showed that Gated Recurrent units (GRU) produced better prediction in comparison to other models. In conclusion, gated recurrent unit implementation could give benefit to police in predicting crime. Hence, time series analysis using GRU could be a prospective additional feature in DataDetective

    Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements

    Get PDF
    This book is a reprint of the Special Issue entitled "Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements" that was published in Remote Sensing, MDPI. It provides insights into both core technical challenges and some selected critical applications of satellite remote sensing image analytics

    Flint International Statistics Conference Agenda

    Get PDF
    Conference Agenda—Keynote Speakers, Invited & Contributed Talks, Posters, Field Trips Tuesday, June 24 Wednesday, June 25 Thursday, June 26 Friday, June 27 Saturday, June 28 Select Sessions: Elart von Collani “Statistics as a general tool for all sciences.” Francesca Greselin “Measuring inequality at the time of the Great Divergence.” Ernest Fokoue “Recent Applications of Statistical Data Mining for Big Data Predictive Analysis.” Vladimir Kaishev “Probability and statistics in actuarial applications.” Galia Novikova “Data Mining for Software Development Quality Management.” Leda Minkova “Stochastic Models and Statistical Applications.” Krzysztof Podgorski “Non-Gaussian stochastic models: theory and applications.” Kristina Sendova “Risk measures, probability measures and mortality.

    Log-based Evaluation of Label Splits for Process Models

    Get PDF
    Process mining techniques aim to extract insights in processes from event logs. One of the challenges in process mining is identifying interesting and meaningful event labels that contribute to a better understanding of the process. Our application area is mining data from smart homes for elderly, where the ultimate goal is to signal deviations from usual behavior and provide timely recommendations in order to extend the period of independent living. Extracting individual process models showing user behavior is an important instrument in achieving this goal. However, the interpretation of sensor data at an appropriate abstraction level is not straightforward. For example, a motion sensor in a bedroom can be triggered by tossing and turning in bed or by getting up. We try to derive the actual activity depending on the context (time, previous events, etc.). In this paper we introduce the notion of label refinements, which links more abstract event descriptions with their more refined counterparts. We present a statistical evaluation method to determine the usefulness of a label refinement for a given event log from a process perspective. Based on data from smart homes, we show how our statistical evaluation method for label refinements can be used in practice. Our method was able to select two label refinements out of a set of candidate label refinements that both had a positive effect on model precision.Comment: Paper accepted at the 20th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, to appear in Procedia Computer Scienc

    Scalable k-Means Clustering via Lightweight Coresets

    Full text link
    Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k-means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD

    Using Biotic Interaction Networks for Prediction in Biodiversity and Emerging Diseases

    Get PDF
    Networks offer a powerful tool for understanding and visualizing inter-species interactions within an ecology. Previously considered examples, such as trophic networks, are just representations of experimentally observed direct interactions. However, species interactions are so rich and complex it is not feasible to directly observe more than a small fraction. In this paper, using data mining techniques, we show how potential interactions can be inferred from geographic data, rather than by direct observation. An important application area for such a methodology is that of emerging diseases, where, often, little is known about inter-species interactions, such as between vectors and reservoirs. Here, we show how using geographic data, biotic interaction networks that model statistical dependencies between species distributions can be used to infer and understand inter-species interactions. Furthermore, we show how such networks can be used to build prediction models. For example, for predicting the most important reservoirs of a disease, or the degree of disease risk associated with a geographical area. We illustrate the general methodology by considering an important emerging disease - Leishmaniasis. This data mining approach allows for the use of geographic data to construct inferential biotic interaction networks which can then be used to build prediction models with a wide range of applications in ecology, biodiversity and emerging diseases
    corecore