10 research outputs found
On Finding Frequent Patterns in Event Sequences
Given a directed acyclic graph with labeled vertices, we consider the problem
of finding the most common label sequences ("traces") among all paths in the
graph (of some maximum length m). Since the number of paths can be huge, we
propose novel algorithms whose time complexity depends only on the size of the
graph, and on the frequency epsilon of the most frequent traces. In addition,
we apply techniques from streaming algorithms to achieve space usage that
depends only on epsilon, and not on the number of distinct traces. The abstract
problem considered models a variety of tasks concerning finding frequent
patterns in event sequences. Our motivation comes from working with a data set
of 2 million RFID readings from baggage trolleys at Copenhagen Airport. The
question of finding frequent passenger movement patterns is mapped to the above
problem. We report on experimental findings for this data set.Comment: Appears in proceedings of ICDM '10: The 10th IEEE International
Conference on Data Mining. Publisher: IEE
Knowledge Discovery from Satellite Images for Drought Monitoring in Food Insecure Areas
Attributed to climatic change and uncertainty of weather conditions, drought has become a recurrent phenomenon. It is manifested by erratic and uncertain rainfall distribution in rainfall dependent farming areas. The hitherto methods of monitoring drought employed conventional methods that rely on availability of metrological data. The objectives of this research were to: 1) identify the critical factors for efficiently implementing geo-spatial information for drought monitoring, 2) develop a new approach for extracting knowledge from satellite imageries for real time drought monitoring in food insecure areas, and 3) validate and calibrate the new approach for national and regional applications. For this research, satellite data from MSG and NOAA AVHRR were used. The preliminary results confirmed that real time MSG satellite data can be used for monitoring drought in food insecure areas. The output of this research helps decision makers in taking the appropriate actions in time for saving millions of lives in drought affected areas using advanced satellite technology
Motif Discovery in Physiological Datasets: A Methodology for Inferring Predictive Elements
In this article, we propose a methodology for identifying predictive physiological patterns in the absence of prior knowledge. We use the principle of conservation to identify activity that consistently precedes an outcome in patients, and describe a two-stage process that allows us to efficiently search for such patterns in large datasets. This involves first transforming continuous physiological signals from patients into symbolic sequences, and then searching for patterns in these reduced representations that are strongly associated with an outcome.
Our strategy of identifying conserved activity that is unlikely to have occurred purely by chance in symbolic data is analogous to the discovery of regulatory motifs in genomic datasets. We build upon existing work in this area, generalizing the notion of a regulatory motif and enhancing current techniques to operate robustly on non-genomic data. We also address two significant considerations associated with motif discovery in general: computational efficiency and robustness in the presence of degeneracy and noise. To deal with these issues, we introduce the concept of active regions and new subset-based techniques such as a two-layer Gibbs sampling algorithm. These extensions allow for a framework for information inference, where precursors are identified as approximately conserved activity of arbitrary complexity preceding multiple occurrences of an event.
We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for motifs in the sudden death population against control populations of normal individuals and those with non-fatal supraventricular arrhythmias. Our results suggest that predictive motif discovery may be able to identify clinically relevant information even in the absence of significant prior knowledge.CIMIT: Center for Integration of Medicine and Innovative TechnologyHarvard University--MIT Division of Health Sciences and Technolog
Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences
We present MOWCATL, an efficient method for mining frequent sequential association rules from multiple sequential data sets with a time lag between the occurrence of an antecedent sequence and the corresponding consequent sequence. This approach finds patterns in one or more sequences that precede the occurrence of patterns in other sequences, with respect to user-specified constraints. In addition to the traditional frequency and support constraints in sequential data mining, this approach uses separate antecedent and consequent inclusion constraints
Features Extraction from Time Series
Time series can be found in various domains like medicine, engineering, and finance. Generally speaking, a time series is a sequence of data that represents recorded values of a phenomenon over time. This thesis studies time series mining, including transformation and distance measure, anomaly or anomalies detection, clustering and remaining useful life estimation.
In the course of the first mining task (transformation and distance measure), in order to increase the accuracy of distance measure between transformed series (symbolic series), we introduce a novel calculation of distance between symbols. By integrating this newly defined method to symbolic aggregate approximation and its extensions, the experimental results show this proposed method is promising.
During the process of the second mining task (anomaly or anomalies detection), for the purpose of improving the accuracy of anomaly or anomalies detection, we propose a distance measure method and an anomalies detection calculation. These proposed methods, together with previous published anomaly detection methods, are applied to real ECG data selected from MIT-BIH database. The experimental results show that our proposed outperforms other methods.
During the course of the third mining task (clustering), we present an automatic clustering method, called AT-means, which can automatically carry out clustering for a given time series dataset: from the calculation of global average time series to the setting of initial centres and the determination of the number of clusters. The performance of the proposed method was tested on 10 benchmark time series datasets obtained from UCR database. For comparison, the K-means method with three different conditions are also applied to the same datasets. The experimental results show the proposed method outperforms the compared K-means approaches.
During the process of the fourth mining task (remaining useful life estimation), all the original data are transformed into low-dimensional space through principal components analysis. We then proposed a novel multidimensional time series distance measure method, called as multivariate time series warping distance (MTWD), for remaining useful life estimation. This whole process is tested on the CMAPSS (Commercial Modular Aero Propulsion System Simulation) datasets and the performance is compared with two existing methods. The experimental results show that the estimated remaining useful life (RUL) values are closer to real RUL values when compared with the comparison methods.
Our work contributes to the time series mining by introducing novel approaches to distance measure, anomalies detection, clustering and RUL estimation. We furthermore apply our proposed methods and related methods to benchmark datasets. The experimental results show that our methods are better than previously published methods in terms of accuracy and efficiency
Un modèle hybride pour le support à l'apprentissage dans les domaines procéduraux et mal définis
Pour construire des systèmes tutoriels intelligents capables d'offrir une assistance hautement personnalisée, une solution populaire est de représenter les processus cognitifs pertinents des apprenants à l'aide d'un modèle cognitif. Toutefois, ces systèmes tuteurs dits cognitifs ne sont applicables que pour des domaines simples et bien définis, et ne couvrent pas les aspects liés à la cognition spatiale. De plus, l'acquisition des connaissances pour ces systèmes est une tâche ardue et coûteuse en temps. Pour répondre à cette problématique, cette thèse propose un modèle hybride qui combine la modélisation cognitive avec une approche novatrice basée sur la fouille de données pour extraire automatiquement des connaissances du domaine à partir de traces de résolution de problème enregistrées lors de l'usagé du système. L'approche par la fouille de données n'offre pas la finesse de la modélisation cognitive, mais elle permet d'extraire des espaces problèmes partiels pour des domaines mal définis où la modélisation cognitive n'est pas applicable. Un modèle hybride permet de profiter des avantages de la modélisation cognitive et de ceux de l'approche fouille de données. Des algorithmes sont présentés pour exploiter les connaissances et le modèle a été appliqué dans un domaine mal défini : l'apprentissage de la manipulation du bras robotisé Canadarm2. \ud
______________________________________________________________________________ \ud
MOTS-CLÉS DE L’AUTEUR : Systèmes tutoriels intelligents, cognition spatiale, robotique, fouille de donnée