264 research outputs found

    Mining Weighted Frequent Closed Episodes over Multiple Sequences

    Get PDF
    Frequent episode discovery is introduced to mine useful and interesting temporal patterns from sequential data. The existing episode mining methods mainly focused on mining from a single long sequence consisting of events with time constraints. However, there can be multiple sequences of different importance as the persons or entities associated with each sequence can be of different importance. Aiming to mine episodes in multiple sequences of different importance, we first define a new kind of episodes, i.e., the weighted frequent closed episodes, to take sequence importance, episode distribution and occurrence frequency into account together. Secondly, to facilitate the mining of such new episodes, we present a new concept called maximal duration serial episodes to cut a whole sequence into multiple maximum episodes using duration constraints, and discuss its properties for episode shrinking processing. Finally, based on the theoretical properties, we propose a two-phase approach to efficiently mine these new episodes. In Phase I, we adopt a level-wise episode shrinking framework to discover the candidate frequent closed episodes with the same prefixes, and in Phase II, we match the candidates with different prefixes to find the frequent close episodes. Experiments on simulated and real datasets demonstrate that the proposed episode mining strategy has good mining effectiveness and efficiency

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    A Geoconservation perspective on the trace fossil record associated with the end – Ordovician mass extinction and glaciation in the Welsh Basin

    Get PDF
    In this thesis I have illustrated the value of our geological heritage and geodiversity by focussing on a particular detailed aspect of the geological and palaeontological record, i.e. the trace fossil record associated with the end Ordovician (Hirnantian) global glaciation and extinction episode. The major elements of this work that are new are: • a significantly improved understanding of the nature of the soft sediment deformation, and in particular the role of “debrites” as basal landslide decollements in the Lower Palaozic Llangrannog rock succession of West Wales, • a much more detailed description of the trace fossil ichnocoenose present in the Llangrannog succession than has previously been published • an improved understanding of the nature of the ecological perturbation associated with the Hirnantian (Late Ordovician) Glacial “ice-house”, and the apparent role of an opportunistic soft body fauna in filling ecological niches vacated as a consequence of the associated extinction. • Considerable thought has been given to the question of how to value abiotic nature, and it is argued that the methods of conservation valuation associated with “Geosystem services” and in particular “Natural Capital” hold considerable potential for the Geoconservation community to engage with the public and with policy makers. • As a direct result of this research, two formal proposals have been put forward for new RIGS sites, together with a new geological SSSI

    Harnessing rare category trinity for complex data

    Get PDF
    In the era of big data, we are inundated with the sheer volume of data being collected from various domains. In contrast, it is often the rare occurrences that are crucially important to many high-impact domains with diverse data types. For example, in online transaction platforms, the percentage of fraudulent transactions might be small, but the resultant financial loss could be significant; in social networks, a novel topic is often neglected by the majority of users at the initial stage, but it could burst into an emerging trend afterward; in the Sloan Digital Sky Survey, the vast majority of sky images (e.g., known stars, comets, nebulae, etc.) are of no interest to the astronomers, while only 0.001% of the sky images lead to novel scientific discoveries; in the worldwide pandemics (e.g., SARS, MERS, COVID19, etc.), the primary cases might be limited, but the consequences could be catastrophic (e.g., mass mortality and economic recession). Therefore, studying such complex rare categories have profound significance and longstanding impact in many aspects of modern society, from preventing financial fraud to uncovering hot topics and trends, from supporting scientific research to forecasting pandemic and natural disasters. In this thesis, we propose a generic learning mechanism with trinity modules for complex rare category analysis: (M1) Rare Category Characterization - characterizing the rare patterns with a compact representation; (M2) Rare Category Explanation - interpreting the prediction results and providing relevant clues for the end-users; (M3) Rare Category Generation - producing synthetic rare category examples that resemble the real ones. The key philosophy of our mechanism lies in "all for one and one for all" - each module makes unique contributions to the whole mechanism and thus receives support from its companions. In particular, M1 serves as the de-novo step to discover rare category patterns on complex data; M2 provides a proper lens to the end-users to examine the outputs and understand the learning process; and M3 synthesizes real rare category examples for data augmentation to further improve M1 and M2. To enrich the learning mechanism, we develop principled theorems and solutions to characterize, understand, and synthesize rare categories on complex scenarios, ranging from static rare categories to time-evolving rare categories, from attributed data to graph-structured data, from homogeneous data to heterogeneous data, from low-order connectivity patterns to high-order connectivity patterns, etc. It is worthy of mentioning that we have also launched one of the first visual analytic systems for dynamic rare category analysis, which integrates our developed techniques and enables users to investigate complex rare categories in practice

    Towards a crowdsourced solution for the authoring bottleneck in interactive narratives

    Get PDF
    Interactive Storytelling research has produced a wealth of technologies that can be employed to create personalised narrative experiences, in which the audience takes a participating rather than observing role. But so far this technology has not led to the production of large scale playable interactive story experiences that realise the ambitions of the field. One main reason for this state of affairs is the difficulty of authoring interactive stories, a task that requires describing a huge amount of story building blocks in a machine friendly fashion. This is not only technically and conceptually more challenging than traditional narrative authoring but also a scalability problem. This thesis examines the authoring bottleneck through a case study and a literature survey and advocates a solution based on crowdsourcing. Prior work has already shown that combining a large number of example stories collected from crowd workers with a system that merges these contributions into a single interactive story can be an effective way to reduce the authorial burden. As a refinement of such an approach, this thesis introduces the novel concept of Crowd Task Adaptation. It argues that in order to maximise the usefulness of the collected stories, a system should dynamically and intelligently analyse the corpus of collected stories and based on this analysis modify the tasks handed out to crowd workers. Two authoring systems, ENIGMA and CROSCAT, which show two radically different approaches of using the Crowd Task Adaptation paradigm have been implemented and are described in this thesis. While ENIGMA adapts tasks through a realtime dialog between crowd workers and the system that is based on what has been learned from previously collected stories, CROSCAT modifies the backstory given to crowd workers in order to optimise the distribution of branching points in the tree structure that combines all collected stories. Two experimental studies of crowdsourced authoring are also presented. They lead to guidelines on how to employ crowdsourced authoring effectively, but more importantly the results of one of the studies demonstrate the effectiveness of the Crowd Task Adaptation approach

    A geographic knowledge discovery approach to property valuation

    Get PDF
    This thesis involves an investigation of how knowledge discovery can be applied in the area Geographic Information Science. In particular, its application in the area of property valuation in order to reveal how different spatial entities and their interactions affect the price of the properties is explored. This approach is entirely data driven and does not require previous knowledge of the area applied. To demonstrate this process, a prototype system has been designed and implemented. It employs association rule mining and associative classification algorithms to uncover any existing inter-relationships and perform the valuation. Various algorithms that perform the above tasks have been proposed in the literature. The algorithm developed in this work is based on the Apriori algorithm. It has been however, extended with an implementation of a ‘Best Rule’ classification scheme based on the Classification Based on Associations (CBA) algorithm. For the modelling of geographic relationships a graph-theoretic approach has been employed. Graphs have been widely used as modelling tools within the geography domain, primarily for the investigation of network-type systems. In the current context, the graph reflects topological and metric relationships between the spatial entities depicting general spatial arrangements. An efficient graph search algorithm has been developed, based on the Djikstra shortest path algorithm that enables the investigation of relationships between spatial entities beyond first degree connectivity. A case study with data from three central London boroughs has been performed to validate the methodology and algorithms, and demonstrate its effectiveness for computer aided property valuation. In addition, through the case study, the influence of location in the value of properties in those boroughs has been examined. The results are encouraging as they demonstrate the effectiveness of the proposed methodology and algorithms, provided that the data is appropriately pre processed and is of high quality

    Spatial Big Data Analytics: Classification Techniques for Earth Observation Imagery

    Get PDF
    University of Minnesota Ph.D. dissertation. August 2016. Major: Computer Science. Advisor: Shashi Shekhar. 1 computer file (PDF); xi, 120 pages.Spatial Big Data (SBD), e.g., earth observation imagery, GPS trajectories, temporally detailed road networks, etc., refers to geo-referenced data whose volume, velocity, and variety exceed the capability of current spatial computing platforms. SBD has the potential to transform our society. Vehicle GPS trajectories together with engine measurement data provide a new way to recommend environmentally friendly routes. Satellite and airborne earth observation imagery plays a crucial role in hurricane tracking, crop yield prediction, and global water management. The potential value of earth observation data is so significant that the White House recently declared that full utilization of this data is one of the nation's highest priorities. However, SBD poses significant challenges to current big data analytics. In addition to its huge dataset size (NASA collects petabytes of earth images every year), SBD exhibits four unique properties related to the nature of spatial data that must be accounted for in any data analysis. First, SBD exhibits spatial autocorrelation effects. In other words, we cannot assume that nearby samples are statistically independent. Current analytics techniques that ignore spatial autocorrelation often perform poorly such as low prediction accuracy and salt-and-pepper noise (i.e., pixels predicted as different from neighbors by mistake). Second, spatial interactions are not isotropic and vary across directions. Third, spatial dependency exists in multiple spatial scales. Finally, spatial big data exhibits heterogeneity, i.e., identical feature values may correspond to distinct class labels in different regions. Thus, learned predictive models may perform poorly in many local regions. My thesis investigates novel SBD analytics techniques to address some of these challenges. To date, I have been mostly focusing on the challenges of spatial autocorrelation and anisotropy via developing novel spatial classification models such as spatial decision trees for raster SBD (e.g., earth observation imagery). To scale up the proposed models, I developed efficient learning algorithms via computational pruning. The proposed techniques have been applied to real world remote sensing imagery for wetland mapping. I also had developed spatial ensemble learning framework to address the challenge of spatial heterogeneity, particularly the class ambiguity issues in geographical classification, i.e., samples with the same feature values belong to different classes in different spatial zones. Evaluations on three real world remote sensing datasets confirmed that proposed spatial ensemble learning outperforms current approaches such as bagging, boosting, and mixture of experts when class ambiguity exists

    Deliverable D1.1 State of the art and requirements analysis for hypervideo

    Get PDF
    This deliverable presents a state-of-art and requirements analysis report for hypervideo authored as part of the WP1 of the LinkedTV project. Initially, we present some use-case (viewers) scenarios in the LinkedTV project and through the analysis of the distinctive needs and demands of each scenario we point out the technical requirements from a user-side perspective. Subsequently we study methods for the automatic and semi-automatic decomposition of the audiovisual content in order to effectively support the annotation process. Considering that the multimedia content comprises of different types of information, i.e., visual, textual and audio, we report various methods for the analysis of these three different streams. Finally we present various annotation tools which could integrate the developed analysis results so as to effectively support users (video producers) in the semi-automatic linking of hypervideo content, and based on them we report on the initial progress in building the LinkedTV annotation tool. For each one of the different classes of techniques being discussed in the deliverable we present the evaluation results from the application of one such method of the literature to a dataset well-suited to the needs of the LinkedTV project, and we indicate the future technical requirements that should be addressed in order to achieve higher levels of performance (e.g., in terms of accuracy and time-efficiency), as necessary
    corecore