22,771 research outputs found

    Geographic Data Mining and Knowledge Discovery

    Get PDF
    Geographic data are information associated with a location on the surface of the Earth. They comprise spatial attributes (latitude, longitude, and altitude) and non-spatial attributes (facts related to a location). Traditionally, Physical Geography datasets were considered to be more valuable, thus attracted most research interest. But with the advancements in remote sensing technologies and widespread use of GPS enabled cellphones and IoT (Internet of Things) devices, recent years witnessed explosive growth in the amount of available Human Geography datasets. However, methods and tools that are capable of analyzing and modeling these datasets are very limited. This is because Human Geography data are inherently difficult to model due to its characteristics (non-stationarity, uneven distribution, etc.). Many algorithms were invented to solve these challenges -- especially non-stationarity -- in the past few years, like Geographically Weighted Regression, Multiscale GWR, Geographical Random Forest, etc. They were proven to be much more efficient than the general machine learning algorithms that are not specifically designed to deal with non-stationarity. However, such algorithms are far from perfect and have a lot of room for improvement. This dissertation proposed multiple algorithms for modeling non-stationary geographic data. The main contributions are: (1) designed a novel method to evaluate non-stationarity and its impact on regression models; (2) proposed the Geographic R-Partition tree for modeling non-stationary data; (3) proposed the IDW-RF algorithm, which uses the advantages of Random Forests to deal with extremely unevenly distributed geographic datasets; (4) proposed the LVRF algorithm, which models geographic data using a latent variable based method. Experiments show that these algorithms are very efficient and outperform other state-of-the-art algorithms in certain scenarios

    Systems Development, Data Mining, and Knowledge Discovery

    Get PDF
    The primary role of the Technical Integration Office is to provide technical solutions and services to different branches at KSC (Kennedy Space Center) and NASA program customers. The Technical Integration Office helps support KSC's operational needs by providing services such as digital connectivity, data center services, modelling and simulation tools, and communication video services. To learn the necessary technology and processes for my internship, I am working on two projects: learning C# (C Sharp programming language) with SQL and developing requirements for a PX (Communication and Public Engagement) inventory management system. To learn how to efficiently program with C#, my mentor assigned me to complete a sports informatics application that would let users discover facts and rules about various sports. The sports informatics application comes with search capabilities, report generating features, rule lists that users can modify, and diagrams for various sport strategies. To further build upon this project, I also developed a sport simulation game with the application. Once I begin more SQL-based projects, I will have the opportunity to learn how to manage databases and link SQL servers with C# programs. To develop requirements for the inventory management system, I have met with PX representatives and toured their storage facilities to see how they organize and store their items and equipment. I will also be meeting with representatives from the budget office to find out what information must be in a system budget report. The main components the system must have are customer request management, a search feature for items and equipment, report generation capabilities, and automated system warnings when item quantities reach or go below administrator-specified threshold levels. I have drafted questions and shall statements that will ultimately become part of the inventory management system requirements document

    Relational methodology for data mining and knowledge discovery

    Get PDF
    Knowledge discovery and data mining methods have been successful in many domains. However, their abilities to build or discover a domain theory remain unclear. This is largely due to the fact that many fundamental KDD&DM methodological questions are still unexplored such as (1) the nature of the information contained in input data relative to the domain theory, and (2) the nature of the knowledge that these methods discover. The goal of this paper is to clarify methodological questions of KDD&DM methods. This is done by using the concept of Relational Data Mining (RDM), representative measurement theory, an ontology of a subject domain, a many-sorted empirical system (algebraic structure in the first-order logic), and an ontology of a KDD&DM method. The paper concludes with a review of our RDM approach and \u27Discovery\u27 system built on this methodology that can analyze any hypotheses represented in the first-order logic and use any input by representing it in many-sorted empirical system

    Constraining Earth’s plate tectonic evolution through data mining and knowledge discovery

    Get PDF
    Global reconstructions are reasonably well understood to ~200 Ma. However, two first-order uncertainties remain unresolved in their development: firstly, the critical dependency on a self-consistent global reference frame; and secondly, the fundamental difficulty in objectively predicting the location and type of tectonic paleo-boundaries. In this thesis I present three new studies directly addressing these fundamental geoscientific questions. Through the joint evaluation of global seafloor hotspot track observations (for times younger than 80 Ma), first-order geodynamic estimates of global net lithospheric rotation (NLR), and parameter estimation for paleo-trench migration (TM) behaviours, the first chapter presents a suite of new geodynamically consistent, data-optimised global absolute reference frames spanning from 220 Ma through to present-day. In the second chapter, using an updated paleomagnetic pole compilation to contain age uncertainties, I identify the optimal APWP pole configuration for 16 major cratonic blocks minimising both plate velocity and velocity gradients characteristic of eccentric changes in predicted plate motions, producing a new global reference frame for the Phanerozoic consistent with physical geodynamic principles. In the final chapter of my thesis I identify paleo-tectonic environments on Earth through a machine learning approach using global geochemical data, deriving a set of first-order discriminatory tectonic environment models for mid-ocean ridge (MOR), subduction (ARC), and oceanic hotspot (OIB) environments. Key discriminatory geochemical attributes unique to each first-order tectonic environment were identified, enabling a data-rich identification of samples of unknown affinity. Applying these models to Neoproterozoic data, 56 first-order tectonic paleo-boundaries associated with Rodinia supercontinent amalgamation and dispersal were identified and evaluated against published Neoproterozoic reconstructions

    Secondary Analysis of Electronic Health Records

    Get PDF
    Health Informatics; Ethics; Data Mining and Knowledge Discovery; Statistics for Life Sciences, Medicine, Health Science

    Improve Data Mining and Knowledge Discovery Through the Use of MatLab

    Get PDF
    Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(R) (MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and its enormous availability of built in functionalities and toolboxes make it suitable to perform numerical computations and simulations as well as a data mining tool. Engineers and scientists can take advantage of the readily available functions/toolboxes to gain wider insight in their perspective data mining experiments

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

    Data mining and knowledge discovery: a guided approach base on monotone boolean functions

    Get PDF
    This dissertation deals with an important problem in Data Mining and Knowledge Discovery (DM & KD), and Information Technology (IT) in general. It addresses the problem of efficiently learning monotone Boolean functions via membership queries to oracles. The monotone Boolean function can be thought of as a phenomenon, such as breast cancer or a computer crash, together with a set of predictor variables. The oracle can be thought of as an entity that knows the underlying monotone Boolean function, and provides a Boolean response to each query. In practice, it may take the shape of a human expert, or it may be the outcome of performing tasks such as running experiments or searching large databases. Monotone Boolean functions have a general knowledge representation power and are inherently frequent in applications. A key goal of this dissertation is to demonstrate the wide spectrum of important real-life applications that can be analyzed by using the new proposed computational approaches. The applications of breast cancer diagnosis, computer crashing, college acceptance policies, and record linkage in databases are here used to demonstrate this point and illustrate the algorithmic details. Monotone Boolean functions have the added benefit of being intuitive. This property is perhaps the most important in learning environments, especially when human interaction is involved, since people tend to make better use of knowledge they can easily interpret, understand, validate, and remember. The main goal of this dissertation is to design new algorithms that can minimize the average number of queries used to completely reconstruct monotone Boolean functions defined on a finite set of vectors V = {0,1}^n. The optimal query selections are found via a recursive algorithm in exponential time (in the size of V). The optimality conditions are then summarized in the simple form of evaluative criteria, which are near optimal and only take polynomial time to compute. Extensive unbiased empirical results show that the evaluative criterion approach is far superior to any of the existing methods. In fact, the reduction in average number of queries increases exponentially with the number of variables n, and faster than exponentially with the oracle\u27s error rate
    • …
    corecore