322,259 research outputs found

    Code generator for integrating warehouse XML data sources.

    Get PDF
    XML---the extensible Markup Language, has been recognized as the standard for data representation and exchange on the world wide web. Vast amounts of XML data are available on the web. Since the information on the web is stored on separate web pages, it is very hard to combine pieces of information for decision support purposes. Data warehouse data integration provides a solution for integrating the different XML source data into a unique format with meaningful information for decision support systems. A data warehouse is a large integrated database organized around major subjects of an enterprise for the purpose of decision support querying. Many enterprises are creating their own data warehouse systems from scratch in different varying formats, making the issue of building a more efficient, more reliable, cost-effective and easy-to-use data warehouse system important. Building a code generator for creating a program that automatically integrates XML data sources into a target data warehouse is one solution. There is little research showing the use of the newest XML techniques in code generator for data warehouse XML data integration. This thesis proposes a Warehouse Integrator code generator for XML (WIG4X), which integrates XML data sources into a target data warehouse by first generating Java programs for data extracting, cleaning and loading XML data into the data warehouse. WIG4X system also generates the programs for creating XML views from the data warehouse. XML schema mapping strategy is employed for structural integration of each XML data source to data warehouse using a first order logic-like-language similar to that used in INFOMASTER. The content integration is handled through XML data extraction, conversion constraints, data cleaning and data loading. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2001 .L57. Source: Masters Abstracts International, Volume: 40-06, page: 1549. Adviser: Christie Ezeife. Thesis (M.Sc.)--University of Windsor (Canada), 2002

    DATA CURATION FOR MODELING TALL FESCUE BIOMASS DYNAMICS WITH DSSAT-CSM

    Get PDF
    While models for predicting forage production are available to aid management decisions for some forage crops, there is limited research for a yield model designed specifically for tall fescue (Schedonorus arundinaceus). Therefore, our objective was to adapt an existing perennial forage model, the Decision Support System for Agrotechnology Transfer Cropping Systems Model (DSSAT-CSM) for predicting forage biomass of tall fescue in the southern Great Plains. To evaluate model performance, there must first be a high level of data manipulation and cleaning. In this project, a cohesive dataset combining biomass, weather, soil, and management data were structured into DSSAT standard file format to be used in future tall fescue crop modeling analysis

    A Comparison of Decision Tree with Logistic Regression Model for Prediction of Worst Non-Financial Payment Status in Commercial Credit

    Get PDF
    Credit risk prediction is an important problem in the financial services domain. While machine learning techniques such as Support Vector Machines and Neural Networks have been used for improved predictive modeling, the outcomes of such models are not readily explainable and, therefore, difficult to apply within financial regulations. In contrast, Decision Trees are easy to explain, and provide an easy to interpret visualization of model decisions. The aim of this paper is to predict worst non-financial payment status among businesses, and evaluate decision tree model performance against traditional Logistic Regression model for this task. The dataset for analysis is provided by Equifax and includes over 300 potential predictors from more than 11 million unique businesses. After a data discovery phase, including imputation, cleaning, and transforming potential predictors, Decision Tree and Logistic Regression models were built on the same finalized analysis dataset. Evaluating the models based on ROC index, and Kolmogorov-Smirnov statistic, Decision Tree performed as well as the Logistic Regression model

    Comparison of Classification Algorithm for Crop Decision based on Environmental Factors using Machine Learning

    Get PDF
    Crop decision is a very complex process. In Agriculture it plays a vital role. Various biotic and abiotic factors affect this decision. Some crucial Environmental factors are Nitrogen Phosphorus, Potassium, pH, Temperature, Humidity, Rainfall. Machine Learning Algorithm can perfectly predict the crop necessary for this environmental condition. Various algorithms and model are used for this process such as feature selection, data cleaning, Training, and testing split etc. Algorithms such as Logistic regression, Decision Tree, Support vector machine, K- Nearest Neighbour, Navies Bayes, Random Forest. A comparison based on the accuracy parameter is presented in this paper along with various training and testing split for optimal choice of best algorithm. This comparison is done on two tools i.e., on Google collab using python and its libraries for implementation of Machine Learning Algorithm and WEKA which is a pre-processing tool to compare various algorithm of machine learning

    Declarative Data Cleaning : Language, Model, and Algorithms

    Get PDF
    Projet CARAVELThe problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning

    XXXVIII Jornadas de Automática

    Get PDF
    Producción CientíficaThis work presents a decision-support tool to address the model-based optimization approach for online load allocation and scheduling of cleaning operations in an evaporation network. The aim is improving the resource efficiency by supplying the optimal solution for a given production goal. The approach includes the semi-automatic update of evaporator models, which is based on historical data for minimal modelling effort. The structure of the problem is formulated via mixed-integer programming and integrated into the plant supervision systems. Production constraints, concerns about the practical implementation and visualization preferences are also taken into account in the design of the prototypical tool.MINECO/FEDER Grant DPI2015-70975 (INOPTCON)EU H2020-SPIRE Grant Agreement nº 723575 (CoPro

    QUALITATIVE CLEANING METHODS ON DISTRIBUTED IOT DATASETS

    Get PDF
    Data analysis encompasses a set of individual steps that allows a typically large data set to be remodeled such that actionable information can be extracted from the data set, which can then be used to support decision-making. Data generated from multiple distributed sources is usually dirty by default and dirty data will often lead to inaccurate or incomplete data analysis. As a result, without first performing data cleaning, wrong or fatally flawed business decisions is inevitable. IoT describes a network of physical and virtual objects containing software, electrical components and sensors that exchange data with other connected devices over the internet. The data generated from these sensors is distributed by design and my aim for this thesis is to explore qualitative data cleaning methods such as integrity constraints and functional dependency violations to perform error detection and in place error repairing techniques on the distributed data set generated from these devices. This approach is relatively new since most of the prior data cleaning research in this domain have focused on quantitative techniques such as outlier detection. The next goal for my thesis will then be to perform exploratory data analysis on the data sets from these IoT sources using data wrangling tools on open source frameworks such as Optimus under Apache Spark to handle the unstructured and semi structured formats of the data generated from these sources. The end goal will be to generate clean data from these data sources such that insights can be gained to support decision making for the purpose of product improvement

    The Role of Metadata for Effective Data Warehouse

    Get PDF
    Metadata efficient method for managing Data Warehouse (DW). It is also an effective tool in reducing the time or speed to answer queries. In addition, it achieved capabilities of the integration and standardization, thus lead to faster, clear and accurate decision-making in the right time. This paper provides the definition of metadata concept, and using metadata in Data Cleaning; which it identify the sources, types of fields, and choose the appropriate algorithm. In addition, useful in Decision Support System (DSS); which it improve efficiency of analysis and reduces response time of quer

    PERANAN LINGKUNGAN SEKOLAH TERHADAP PENGUATAN KARAKTER PEDULI LINGKUNGAN SISWA SMAN 111 JAKARTA

    Get PDF
    This study aims to describe the role of the school environment in strengthening the environmental care character of students at SMA Negeri 111 Jakarta. This type of research is a qualitative approach with descriptive research methods. Data collection techniques in this study used interviews, observation, and documentation. The validity of the data in this study uses triangulation of sources and techniques. Data analysis used the interactive model of Miles and Huberman with three stages, namely data reduction, data display, decision making and verification which were presented in a qualitative descriptive manner. The results of this study prove that the role of the school environment in strengthening the character of caring for the environment is manifested in several ways including, (1) providing exemplary habituation, (2) habituation of maintaining cleanliness and environmental sustainability, (3) the availability of supporting facilities including the provision of cleaning equipment, garbage disposal sites. , toilets and clean water as well as slogans or posters caring for the environment in various corners of the school, and (4) support the Adiwiyata program by holding clean Fridays, a waste bank and making compost

    Will We Connect Again? Machine Learning for Link Prediction in Mobile Social Networks

    Get PDF
    ABSTRACT In this paper we examine link prediction for two types of data sets with mobility data, namely call data records (from the MIT Reality Mining project) and location-based social networking data (from the companies Gowalla and Brightkite). These data sets contain location information, which we incorporate in the features used for prediction. We also examine different strategies for data cleaning, in particular thresholding based on the amount of social interaction. We investigate the machine learning algorithms Decision Tree, Naïve Bayes, Support Vector Machine, and Logistic Regression. Generally, we find that our feature selection and filtering of the data sets have a major impact on the accuracy of link prediction, both for Reality Mining and Gowalla. Experimentally, the Decision Tree and Logistic Regression classifiers performed best
    corecore