371,608 research outputs found
Statistical models for time sequences data mining
In this paper, we present an adaptive modelling technique for studying past behaviors of objects and predicting the near future events. Our approach is to define a sliding window (of different window sizes) over a time sequence and build autoregression models from subsequences in different windows. The models are representations of past behaviors of the sequence objects. We can use the AR coefficients as features to index subsequences to facilitate the query of subsequences with similar behaviors. We can use a clustering algorithm to group time sequences on their similarity in the feature space. We can also use the AR models for prediction within different windows. Our experiments show that the adaptive model can give better prediction than non-adaptive models.published_or_final_versio
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.Comment: Submitte
Time series forecasting on crime data in Amsterdam for a software company
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn recent years, there have been many discussions of data mining technology implementation in the fight against terrorism and crime. Sentient as a software company has been supporting the police for years by applying data mining techniques in the DataDetective application (Sentient, 2017). Experimenting with various types of predictive model solutions, selecting the most efficient and promising solution are the objectives of this internship.
Initially, extended literatures were reviewed in the field of data mining, crime analysis and crime data mining. Sentient provided 7 years of crime data which was aggregated on daily basis to create a univariate dataset. Also, an incidence type daily aggregation was done to create a multivariate dataset. The prediction length for each solution was 7 days. The experiments were divided into two major categories: Statistical models and neural network models. Neural networks outperformed statistical models for the crime data.
This paper provides the overview of statistical models and neural network models used. A comparative study of all the models on similar dataset gives a clear picture of their performance on available data and generalization capability. Evidently, the experiments showed that Gated Recurrent units (GRU) produced better prediction in comparison to other models. In conclusion, gated recurrent unit implementation could give benefit to police in predicting crime. Hence, time series analysis using GRU could be a prospective additional feature in DataDetective
Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements
This book is a reprint of the Special Issue entitled "Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements" that was published in Remote Sensing, MDPI. It provides insights into both core technical challenges and some selected critical applications of satellite remote sensing image analytics
Flint International Statistics Conference Agenda
Conference Agenda—Keynote Speakers, Invited & Contributed Talks, Posters, Field Trips Tuesday, June 24 Wednesday, June 25 Thursday, June 26 Friday, June 27 Saturday, June 28
Select Sessions: Elart von Collani “Statistics as a general tool for all sciences.” Francesca Greselin “Measuring inequality at the time of the Great Divergence.” Ernest Fokoue “Recent Applications of Statistical Data Mining for Big Data Predictive Analysis.” Vladimir Kaishev “Probability and statistics in actuarial applications.” Galia Novikova “Data Mining for Software Development Quality Management.” Leda Minkova “Stochastic Models and Statistical Applications.” Krzysztof Podgorski “Non-Gaussian stochastic models: theory and applications.” Kristina Sendova “Risk measures, probability measures and mortality.
Log-based Evaluation of Label Splits for Process Models
Process mining techniques aim to extract insights in processes from event
logs. One of the challenges in process mining is identifying interesting and
meaningful event labels that contribute to a better understanding of the
process. Our application area is mining data from smart homes for elderly,
where the ultimate goal is to signal deviations from usual behavior and provide
timely recommendations in order to extend the period of independent living.
Extracting individual process models showing user behavior is an important
instrument in achieving this goal. However, the interpretation of sensor data
at an appropriate abstraction level is not straightforward. For example, a
motion sensor in a bedroom can be triggered by tossing and turning in bed or by
getting up. We try to derive the actual activity depending on the context
(time, previous events, etc.). In this paper we introduce the notion of label
refinements, which links more abstract event descriptions with their more
refined counterparts. We present a statistical evaluation method to determine
the usefulness of a label refinement for a given event log from a process
perspective. Based on data from smart homes, we show how our statistical
evaluation method for label refinements can be used in practice. Our method was
able to select two label refinements out of a set of candidate label
refinements that both had a positive effect on model precision.Comment: Paper accepted at the 20th International Conference on
Knowledge-Based and Intelligent Information & Engineering Systems, to appear
in Procedia Computer Scienc
Scalable k-Means Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on
a coreset are provably competitive with models trained on the full data set. As
such, they have been successfully used to scale up clustering models to massive
data sets. While existing approaches generally only allow for multiplicative
approximation errors, we propose a novel notion of lightweight coresets that
allows for both multiplicative and additive errors. We provide a single
algorithm to construct lightweight coresets for k-means clustering as well as
soft and hard Bregman clustering. The algorithm is substantially faster than
existing constructions, embarrassingly parallel, and the resulting coresets are
smaller. We further show that the proposed approach naturally generalizes to
statistical k-means clustering and that, compared to existing results, it can
be used to compute smaller summaries for empirical risk minimization. In
extensive experiments, we demonstrate that the proposed algorithm outperforms
existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD
Using Biotic Interaction Networks for Prediction in Biodiversity and Emerging Diseases
Networks offer a powerful tool for understanding and visualizing inter-species interactions within an ecology. Previously considered examples, such as trophic networks, are just representations of experimentally observed direct interactions. However, species interactions are so rich and complex it is not feasible to directly observe more than a small fraction. In this paper, using data mining techniques, we show how potential interactions can be inferred from geographic data, rather than by direct observation. An important application area for such a methodology is that of emerging diseases, where, often, little is known about inter-species interactions, such as between vectors and reservoirs. Here, we show how using geographic data, biotic interaction networks that model statistical dependencies between species distributions can be used to infer and understand inter-species interactions. Furthermore, we show how such networks can be used to build prediction models. For example, for predicting the most important reservoirs of a disease, or the degree of disease risk associated with a geographical area. We illustrate the general methodology by considering an important emerging disease - Leishmaniasis. This data mining approach allows for the use of geographic data to construct inferential biotic interaction networks which can then be used to build prediction models with a wide range of applications in ecology, biodiversity and emerging diseases
- …