2,221 research outputs found

    HUPSMT: AN EFFICIENT ALGORITHM FOR MINING HIGH UTILITY-PROBABILITY SEQUENCES IN UNCERTAIN DATABASES WITH MULTIPLE MINIMUM UTILITY THRESHOLDS

    Get PDF
    The problem of high utility sequence mining (HUSM) in quantitative se-quence databases (QSDBs) is more general than that of frequent sequence mining in se-quence databases. An important limitation of HUSM is that a user-predened minimum tility threshold is used commonly to decide if a sequence is high utility. However, this is not convincing in many real-life applications as sequences may have diferent importance. Another limitation of HUSM is that data in QSDBs are assumed to be precise. But in the real world, collected data such as by sensor maybe uncertain. Thus, this paper proposes a framework for mining high utility-probability sequences (HUPSs) in uncertain QSDBs (UQS-DBs) with multiple minimum utility thresholds using a minimum utility. Two new width and depth pruning strategies are also introduced to early eliminate low utility or low probability sequences as well as their extensions, and to reduce sets of candidate items for extensions during the mining process. Based on these strategies, a novel ecient algorithm named HUPSMT is designed for discovering HUPSs. Finally, an experimental study conducted in both real-life and synthetic UQSDBs shows the performance of HUPSMT in terms of time and memory consumption

    Approximation to expected support of frequent itemsets in mining probabilistic sets of uncertain data

    Get PDF
    Knowledge discovery and data mining generally discovers implicit, previously unknown, and useful knowledge from data. As one of the popular knowledge discovery and data mining tasks, frequent itemset mining, in particular, discovers knowledge in the form of sets of frequently co-occurring items, events, or objects. On the one hand, in many real-life applications, users mine frequent patterns from traditional databases of precise data, in which users know certainly the presence of items in transactions. On the other hand, in many other real-life applications, users mine frequent itemsets from probabilistic sets of uncertain data, in which users are uncertain about the likelihood of the presence of items in transactions. Each item in these probabilistic sets of uncertain data is often associated with an existential probability expressing the likelihood of its presence in that transaction. To mine frequent itemsets from these probabilistic datasets, many existing algorithms capture lots of information to compute expected support. To reduce the amount of space required, algorithms capture some but not all information in computing or approximating expected support. The tradeoff is that the upper bounds to expected support may not be tight. In this paper, we examine several upper bounds and recommend to the user which ones consume less space while providing good approximation to expected support of frequent itemsets in mining probabilistic sets of uncertain data

    BIG DATA MINING FOR INTERESTING PATTERNS WITH MAP REDUCE TECHNIQUE

    Get PDF
    There are many algorithms available in data mining to search interesting patterns from transactional databases of precise data. Frequent pattern mining is a technique to find the frequently occurred items in data mining. Most of the techniques used to find all the interesting patterns from a collection of precise data, where items occurred in each transaction are certainly known to the system. As well as in many real-time applications, users are interested in a tiny portion of large frequent patterns. So the proposed user constrained mining approach, will help to find frequent patterns in which user is interested. This approach will efficiently find user interested frequent patterns by applying user constraints on the collections of uncertain data. The user can specify their own interest in the form of constraints and uses the Map Reduce model to find uncertain frequent pattern that satisfy the user-specified constraintsÂ

    Edge-based mining of frequent subgraphs from graph streams

    Get PDF
    In the current era of Big data, high volumes of valuable data can be generated at a high velocity from high-varieties of data sources in various real-life applications ranging from sensor networks to social networks, from bio-informatics to chemical informatics. In addition, Big data are also available in business, education, engineering, finance, healthcare, scientific, telecommunication, and transportation domains. A collection of these data can be viewed as a big dynamic graph structure. Embedded in them are implicit, previously unknown, and potentially useful knowledge. Consequently, efficient knowledge discovery algorithms for mining frequent subgraphs from these dynamic streaming graph structured data are in demand. On the one hand, some existing algorithms discover collections of frequently co-occurring edges, which may be disjoint. On the other hand, some other existing algorithms discover frequent subgraphs by requiring very large memory space. With high volumes of Big data, available memory space may be limited. To discover collections of frequently co-occurring connected edges, we present in this paper two efficient algorithms that require small memory space. Evaluation results show the efficiency of our edge-based algorithms in mining frequent subgraphs from graph streams

    Similarity processing in multi-observation data

    Get PDF
    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der Sensorüberwachung. Solche Systeme müssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen repräsentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei Schlüsseleigenschaften unterliegen: Zeitliche Veränderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. Gängige Lösungen in diesen Bereichen, die für Single-Observation Data entwickelt wurden, sind in der Regel für den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafür liegt darin, dass diese Ansätze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen Ansprüchen an Lösungsqualität oder Effizienz genügen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren Schlüsseleigenschaften beschäftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. Während erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle Forschungsbeiträge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschäftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur Aktivitätserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von räumlichen Indexstrukturen. Für den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von ähnlichkeitsanfragen vor. Die erste Methode berücksichtigt alle Attribute der Merkmalsvektoren, während die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen häufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder übertragungsfehlern sind gemessene Werte oftmals unvollständig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachträglich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprünglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die Präsenz von Abhängigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden für sichere Daten erlaubt. Andere Ansätze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurück. Dieser Teil der Arbeit präsentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurückliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining übertragen, um beispielsweise das Problem des Frequent Itemset Mining unter Berücksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    Spatiotemporal Big Data Analytics for Future Mobility

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2019. Major: Computer Science. Advisor: Shashi Shekhar. 1 computer file (PDF); xii, 161 pages.Recent years have witnessed the explosion of spatiotemporal big data (e.g. GPS trajectories, vehicle engine measurements, remote sensing imagery, and geotagged tweets) which has a potential to transform our societies. Terabytes of earth observation data are collected every day from thousands of places across the world. Modern vehicles are increasingly equipped with rich sensors that measure hundreds of engine variables (e.g., emissions, fuel consumption, speed, etc) annotated with timestamps and location data for every second of the vehicle’s trip. According to reports by McKinsey and Cisco, leveraging such data is potentially worth hundreds of billions of dollars annually in fuel savings. Spatiotemporal big data are also enabling many modern technologies such as on-demand transportation (e.g. Uber, Lyft). Today, the on-demand economy attracts millions of consumers annually and over $50 billion in spending. Even more growth is expected with the emergence of self-driving cars. However, spatiotemporal big data are of volume, velocity, variety, and veracity that exceed the capability of common spatiotemporal data analytic techniques. My thesis investigates spatiotemporal big data analytics that address the volume and velocity challenges of spatiotemporal big data in the context of novel applications in transportation and engine science, future mobility, and the on-demand economy. The thesis proposes scalable algorithms for mining “Non-compliant Window Co-occurrence Patterns”, which allow the discovery of correlations in spatiotemporal big data with a large number of variables. Novel upper bounds were introduced for a statistical interest measure of association to efficiently prune uninteresting candidate patterns. Case studies with real world engine data demonstrated the ability of the proposed approaches to discover patterns which are of interest to engine scientists. To address the high velocity challenge, the thesis explored online optimization heuristics for matching supply and demand in an on-demand spatial service broker. The proposed algorithms maximize the matching size while also maintaining a balanced provider utilization to ensure robustness against variations in the supply-demand ratio and that providers do not drop out. Proposed algorithms were shown to outperform related work on multiple performance measures. In addition, the thesis proposed a scalable matching and scheduling algorithm for an on-demand pickup and delivery broker for moving consumers with multiple candidate delivery locations and time intervals. Extensive evaluation showed that the proposed approach yields significant computational savings without sacrificing the solution quality

    Predictive Maneuver Planning and Control of an Autonomous Vehicle in Multi-Vehicle Traffic with Observation Uncertainty

    Get PDF
    Autonomous vehicle technology is a promising development for improving the safety, efficiency and environmental impact of on-road transportation systems. However, the task of guiding an autonomous vehicle by rapidly and systematically accommodating the plethora of changing constraints, e.g. of avoiding multiple stationary and moving obstacles, obeying traffic rules, signals and so on as well as the uncertain state observation due to sensor imperfections, remains a major challenge. This dissertation attempts to address this challenge via designing a robust and efficient predictive motion planning framework that can generate the appropriate vehicle maneuvers (selecting and tracking specific lanes, and related speed references) as well as the constituent motion trajectories while considering the differential vehicle kinematics of the controlled vehicle and other constraints of operating in public traffic. The main framework combines a finite state machine (FSM)-based maneuver decision module with a model predictive control (MPC)-based trajectory planner. Based on the prediction of the traffic environment, reference speeds are assigned to each lane in accordance with the detection of objects during measurement update. The lane selection decisions themselves are then incorporated within the MPC optimization. The on-line maneuver/motion planning effort for autonomous vehicles in public traffic is a non-convex problem due to the multiple collision avoidance constraints with overlapping areas, lane boundaries, and nonlinear vehicle-road dynamics constraints. This dissertation proposes and derives some remedies for these challenges within the planning framework to improve the feasibility and optimality of the solution. Specifically, it introduces vehicle grouping notions and derives conservative and smooth algebraic models to describe the overlapped space of several individual infeasible spaces and help prevent the optimization from falling into undesired local minima. Furthermore, in certain situations, a forced objective selection strategy is needed and adopted to help the optimization jump out of local minima. Furthermore, the dissertation considers stochastic uncertainties prevalent in dynamic and complex traffic and incorporate them with in the predictive planning and control framework. To this end, Bayesian filters are implemented to estimate the uncertainties in object motions and then propagate them into the prediction horizon. Then, a pair-wise probabilistic collision condition is defined for objects with non-negligible geometrical shape/sizes and computationally efficient and conservative forms are derived to efficiently and analytically approximate the involved multi-variate integrals. The probabilistic collision evaluation is then applied within a vehicle grouping algorithms to cluster the object vehicles with closeness in positions and speeds and eventually within the stochastic predictive maneuver planner framework to tighten the chanced-constraints given a deterministic confidence margin. It is argued that these steps make the planning problem tractable for real-time implementation on autonomously controlled vehicles

    A Look Upstream: Market Restructuring, Risk, Procurement Contracts and Efficiency

    Get PDF
    We study how market deregulation affects the upstream industry both theoretically and empirically. Our theory predicts that firms respond to increases in uncertainty due to deregulation by writing more rigid contracts with their suppliers. Using the restructuring of the U.S. electricity market as our case study, we find support for our theoretical predictions. Our findings imply a greater emphasis on efficiency at coal mines contracting with restructured plants. The evidence suggests a 17% improvement in productivity at these mines, relative to those contracting with regulated plants. We find, on the other hand, that transaction costs may have increased. We conclude that deregulation has significant impacts upstream from deregulated markets
    • …
    corecore