303 research outputs found
Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases
Uncertain sequence databases are widely used to model data with inaccurate or imprecise timestamps in many real world applications. In this paper, we use uniform distributions to model uncertain timestamps and adopt possible world semantics to interpret temporal uncertain database. We design an incremental approach to manage temporal uncertainty efficiently, which is integrated into the classic pattern-growth SPM algorithm to mine uncertain sequential patterns. Extensive experiments prove that our algorithm performs well in both efficiency and scalability
Recommended from our members
HIGH-PERFORMANCE COMPLEX EVENT PROCESSING FOR DECISION ANALYTICS
Complex Event Processing (CEP) systems are becoming increasingly popular in do- mains for decision analytics such as financial services, transportation, cluster monitoring, supply chain management, business process management, and health care. These systems collect or create high volumes event streams, and often require such event streams to be processed in real-time. To this end, CEP queries are applied for filtering, correlation, ag- gregation, and transformation, to derive high-level, actionable information. Tasks for CEP systems fall into two categories: passive monitoring and proactive monitoring. For passive monitoring, users know their exact needs and express them in CEP queries, then CEP engines evaluate those queries against incoming data events; for proactive monitoring, users cannot tell exactly what they are looking for and need to work with CEP engines to figure out the query. In my thesis, there are contributions for both categories.
For passive monitoring, the first contribution I make is to apply CEP queries over streams with imprecise timestamps, which was infeasible before this work. Existing CEP systems
assumed that the occurrence time of each event is known precisely. However I observe that event occurrence times are often unknown or imprecise due to lossy raw data, granularity mismatch or clock synchronization. Therefore, I propose a temporal model that assigns a time interval to each event to represent all of its possible occurrence times. Under the uncertain temporal model, I further propose two evaluation frameworks, a point-based framework which convert events with time intervals into events with point timestamp before pattern matching, and an event-based framework which matches patterns over events with time intervals directly. I also propose optimizations in these frameworks. My new approach achieves high efficiency for a wide range of workloads tested using both both real traces and synthetic datasets. While existing systems cannot process this type of streams, the throughput of my algorithm achieves as high as tens of thousands of events per second for MapReduce case study. This contribution enables CEP techniques applicable for more application scenarios.
Another contribution for the passive monitoring is that I identify expensive queries in CEP, analyze their runtime complexity, and propose effective optimizations to improve their performance significantly. Those expensive queries involve Kleene closure patterns, flexible event selection strategies, and events with imprecise timestamps. I analyze the runtime complexity of each language component and identify two performance bottlenecks: Kleene closure under the most flexible event selection strategy and confidence computation in the case of imprecise timestamps. For the first bottleneck, I break query evaluation into two parts: pattern matching, which can be shared by many matches and result construction. Optimizations for the shared pattern matching cut cost from exponential to polynomial time and even close-to-linear. To address the second bottleneck, I design a dynamic program- ming algorithm to improve performance. Microbenchmark results show state-of-the-art systems suffer poor performance, while my system can provide 2 to 10 orders of magnitude improvement. A thorough case study on Hadoop cluster monitoring further demonstrates
the efficiency and effectiveness of my proposed techniques: the throughput is over 1 million events per second.
The last problem solved in this thesis is about proactive monitoring: explaining anomalies in CEP-based monitoring and proactive monitoring. CEP queries are used widely for monitoring purpose. When users observe abnormal status in the monitoring results, they annotate the abnormal period and a reference period. Then the system generates explanations by analyzing stream events, and the explanations can be encoded into CEP queries for future monitoring on similar anomalies. An entropy-based distance function is designed to select features for explanation. The new distance function reduces up to 99.2% of features to find ground truth compared to state-of-the-art distance functions for time series. A cluster- based auto labeling algorithm is also designed to leverage unlabeled data to filter noisy features. Compared with alternative techniques, the generated results improves up to 800% on explanation quality, reduces 93.8% of features for conciseness, and achieves as high quality as other techniques on prediction quality. The implementation is also efficient: with 2000 concurrent monitoring queries, triggered explanation analysis returns explanations within a minute and affects the performance only slightly, delaying events processing by less than 1 second
SemImput: Bridging Semantic Imputation with Deep Learning for Complex Human Activity Recognition
The recognition of activities of daily living (ADL) in smart environments is a well-known and an important research area, which presents the real-time state of humans in pervasive computing. The process of recognizing human activities generally involves deploying a set of obtrusive and unobtrusive sensors, pre-processing the raw data, and building classification models using machine learning (ML) algorithms. Integrating data from multiple sensors is a challenging task due to dynamic nature of data sources. This is further complicated due to semantic and syntactic differences in these data sources. These differences become even more complex if the data generated is imperfect, which ultimately has a direct impact on its usefulness in yielding an accurate classifier. In this study, we propose a semantic imputation framework to improve the quality of sensor data using ontology-based semantic similarity learning. This is achieved by identifying semantic correlations among sensor events through SPARQL queries, and by performing a time-series longitudinal imputation. Furthermore, we applied deep learning (DL) based artificial neural network (ANN) on public datasets to demonstrate the applicability and validity of the proposed approach. The results showed a higher accuracy with semantically imputed datasets using ANN. We also presented a detailed comparative analysis, comparing the results with the state-of-the-art from the literature. We found that our semantic imputed datasets improved the classification accuracy with 95.78% as a higher one thus proving the effectiveness and robustness of learned models
Sequential pattern mining with uncertain data
In recent years, a number of emerging applications, such as sensor monitoring systems, RFID networks and location based services, have led to the proliferation of uncertain data. However, traditional data mining algorithms are usually inapplicable in uncertain data because of its probabilistic nature. Uncertainty has to be carefully handled; otherwise, it might significantly downgrade the quality of underlying data mining applications.
Therefore, we extend traditional data mining algorithms into their uncertain versions so that they still can produce accurate results. In particular, we use a motivating example of sequential pattern mining to illustrate how to incorporate uncertain information in the process of data mining. We use possible world semantics to interpret two typical types of uncertainty: the tuple-level existential uncertainty and the attribute-level temporal uncertainty. In an uncertain database, it is probabilistic that a pattern is frequent or not; thus, we define the concept of probabilistic frequent sequential patterns. And various algorithms are designed to mine probabilistic frequent patterns efficiently in uncertain databases. We also implement our algorithms on distributed computing platforms, such as MapReduce and Spark, so that they can be applied in large scale databases.
Our work also includes uncertainty computation in supervised machine learning algorithms. We develop an artificial neural network to classify numeric uncertain data; and a Naive Bayesian classifier is designed for classifying categorical uncertain data streams. We also propose a discretization algorithm to pre-process numerical uncertain data, since many classifiers work with categoric data only. And experimental results in both synthetic and real-world uncertain datasets demonstrate that our methods are effective and efficient
Handling Real-World Context Awareness, Uncertainty and Vagueness in Real-Time Human Activity Tracking and Recognition with a Fuzzy Ontology-Based Hybrid Method
Human activity recognition is a key task in ambient intelligence applications to achieve proper ambient assisted living. There has been remarkable progress in this domain, but some challenges still remain to obtain robust methods. Our goal in this work is to provide a system that allows the modeling and recognition of a set of complex activities in real life scenarios involving interaction with the environment. The proposed framework is a hybrid model that comprises two main modules: a low level sub-activity recognizer, based on data-driven methods, and a high-level activity recognizer, implemented with a fuzzy ontology to include the semantic interpretation of actions performed by users. The fuzzy ontology is fed by the sub-activities recognized by the low level data-driven component and provides fuzzy ontological reasoning to recognize both the activities and their influence in the environment with semantics. An additional benefit of the approach is the ability to handle vagueness and uncertainty in the knowledge-based module, which substantially outperforms the treatment of incomplete and/or imprecise data with respect to classic crisp ontologies. We validate these advantages with the public CAD-120 dataset (Cornell Activity Dataset), achieving an accuracy of 90.1% and 91.07% for low-level and high-level activities, respectively. This entails an improvement over fully data-driven or ontology-based approaches.This work was funded by TUCS (Turku Centre for Computer Science), Finnish Cultural
Foundation, Nokia Foundation, Google Anita Borg Scholarship, CEI BioTIC Project CEI2013-P-3, Contrato-Programa of Faculty of Education, Economy and Technology of Ceuta and Project TIN2012-30939 from National I+D Research Program (Spain). We also thank Fernando Bobillo for his support with FuzzyOWL and FuzzyDL tools
Recommended from our members
Machine Learning Methods for Activity Detection in Wearable Sensor Data Streams
Wearable wireless sensors have the potential for transformative impact on the fields of health and behavioral science. Recent advances in wearable sensor technology have made it possible to simultaneously collect multiple streams of physiological and context data from individuals in natural environments; however, extracting reliable high-level inferences from these raw data streams remains a key data analysis challenge. In this dissertation, we address three challenges that arise when trying to perform activity detection from wearable sensor streams. First, we address the challenge of learning from small amounts of noisy data by proposing a class of conditional random field models for activity detection. We apply this model class to three different activity detection problems, improving performance in all three when compared with standard independent and structured models. Second, we address the challenge of inferring activities from long input sequences by evaluating strategies for pruning the inference dynamic programs used in structured prediction models. We apply these strategies to the proposed structured activity detection models resulting in inference speedups ranging from 66x to 257x with little to no decrease in predictive performance. Finally, we address the challenge of learning from imprecise annotations by proposing a weak supervision framework for learning discrete-time detection models from imprecise continuous-time observations. We apply this framework to both independent and structured models and demonstrate improved performance over weak supervision baselines
- …