6,534 research outputs found

    Data mining as a tool for environmental scientists

    Get PDF
    Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous

    Behavioural pattern identification and prediction in intelligent environments

    Get PDF
    In this paper, the application of soft computing techniques in prediction of an occupant's behaviour in an inhabited intelligent environment is addressed. In this research, daily activities of elderly people who live in their own homes suffering from dementia are studied. Occupancy sensors are used to extract the movement patterns of the occupant. The occupancy data is then converted into temporal sequences of activities which are eventually used to predict the occupant behaviour. To build the prediction model, different dynamic recurrent neural networks are investigated. Recurrent neural networks have shown a great ability in finding the temporal relationships of input patterns. The experimental results show that non-linear autoregressive network with exogenous inputs model correctly extracts the long term prediction patterns of the occupant and outperformed the Elman network. The results presented here are validated using data generated from a simulator and real environments

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Similarity processing in multi-observation data

    Get PDF
    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der SensorĂŒberwachung. Solche Systeme mĂŒssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen reprĂ€sentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei SchlĂŒsseleigenschaften unterliegen: Zeitliche VerĂ€nderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. GĂ€ngige Lösungen in diesen Bereichen, die fĂŒr Single-Observation Data entwickelt wurden, sind in der Regel fĂŒr den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafĂŒr liegt darin, dass diese AnsĂ€tze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen AnsprĂŒchen an LösungsqualitĂ€t oder Effizienz genĂŒgen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren SchlĂŒsseleigenschaften beschĂ€ftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. WĂ€hrend erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle ForschungsbeitrĂ€ge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschĂ€ftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur AktivitĂ€tserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von rĂ€umlichen Indexstrukturen. FĂŒr den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von Ă€hnlichkeitsanfragen vor. Die erste Methode berĂŒcksichtigt alle Attribute der Merkmalsvektoren, wĂ€hrend die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen hĂ€ufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder ĂŒbertragungsfehlern sind gemessene Werte oftmals unvollstĂ€ndig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachtrĂ€glich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprĂŒnglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die PrĂ€senz von AbhĂ€ngigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden fĂŒr sichere Daten erlaubt. Andere AnsĂ€tze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurĂŒck. Dieser Teil der Arbeit prĂ€sentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurĂŒckliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining ĂŒbertragen, um beispielsweise das Problem des Frequent Itemset Mining unter BerĂŒcksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    Enhancement of student performance prediction using modified K-nearest neighbor

    Get PDF
    The traditional K-nearest neighbor (KNN) algorithm uses an exhaustive search for a complete training set to predict a single test sample. This procedure can slow down the system to consume more time for huge datasets. The selection of classes for a new sample depends on a simple majority voting system that does not reflect the various significance of different samples (i.e. ignoring the similarities among samples). It also leads to a misclassification problem due to the occurrence of a double majority class. In reference to the above-mentioned issues, this work adopts a combination of moment descriptor and KNN to optimize the sample selection. This is done based on the fact that classifying the training samples before the searching actually takes place can speed up and improve the predictive performance of the nearest neighbor. The proposed method can be called as fast KNN (FKNN). The experimental results show that the proposed FKNN method decreases original KNN consuming time within a range of (75.4%) to (90.25%), and improve the classification accuracy percentage in the range from (20%) to (36.3%) utilizing three types of student datasets to predict whether the student can pass or fail the exam automatically
    • 

    corecore