210,630 research outputs found
Recommended from our members
Towards a Domain – Specific Comparative Analysis of Data Mining Tools
Advancement in technology has brought in widespread adoption and utilization of data mining tools. Successful implementation of data mining requires a careful assessment of the various data mining tools. Although several works have compared data mining tools based on usability, opensource, integrated data mining tools for statistical analysis, big/small scale, and data visualization, none of them has suggested the tools for various industry-sectors. This paper attempts to provide a comparative study of various data mining tools based on popularity and usage among various industry-sectors such as business, education, and healthcare. The factors used in the comparison are performance and scalability, data access, data preparation, data exploration and visualization, advanced modeling capabilities, programming language, operating system, interfaces, ease of use, and price/license. The following popular data mining tools are assessed: SAS Enterprise Miner, KNIME, and R for business, Moodle Learning Analytics, Blackboard Analytics, and Canvas for education, and RapidMiner, IBM Watson Health, and Tableau for healthcare. It also discusses the critical issues and challenges associated with the adoption of data mining tools. Furthermore, it suggests possible solutions to help various industries choose the best data mining tool that covers their respective data mining requirements
On data integration workflows for an effective management of multidimensional petroleum digital ecosystems in Arabian Gulf Basins
Data integration of multiple heterogeneous datasets from multidimensional petroleum digital ecosystems is an effective way, for extracting information and adding value to knowledge domain from multiple producing onshore and offshore basins. At present, data from multiple basins are scattered and unusable for data integration, because of scale and format differences. Ontology based warehousing and mining modeling are recommended for resolving the issues of scaling and formatting of multidimensional datasets, in which case, seismic and well-domain datasets are described. Issues, such as semantics among different data dimensions and their associated attributes are also addressed by Ontology modeling.Intelligent relationships are built among several petroleum system domains (structure, reservoir, source and seal, for example) at global scale and facilitated the integration process among multiple dimensions in a data warehouse environment. For this purpose, integrated workflows are designed for capturing and modeling unknown relationships among petroleum system data attributes in interpretable knowledge domains.This study is an effective approach in mining and interpreting data views drawn from warehoused exploration and production metadata, with special reference to Arabian onshore and offshore basins
Quantifying discrepancies in opinion spectra from online and offline networks
Online social media such as Twitter are widely used for mining public
opinions and sentiments on various issues and topics. The sheer volume of the
data generated and the eager adoption by the online-savvy public are helping to
raise the profile of online media as a convenient source of news and public
opinions on social and political issues as well. Due to the uncontrollable
biases in the population who heavily use the media, however, it is often
difficult to measure how accurately the online sphere reflects the offline
world at large, undermining the usefulness of online media. One way of
identifying and overcoming the online-offline discrepancies is to apply a
common analytical and modeling framework to comparable data sets from online
and offline sources and cross-analyzing the patterns found therein. In this
paper we study the political spectra constructed from Twitter and from
legislators' voting records as an example to demonstrate the potential limits
of online media as the source for accurate public opinion mining.Comment: 10 pages, 4 figure
Mining Public Opinion about Economic Issues: Twitter and the U.S. Presidential Election
Opinion polls have been the bridge between public opinion and politicians in
elections. However, developing surveys to disclose people's feedback with
respect to economic issues is limited, expensive, and time-consuming. In recent
years, social media such as Twitter has enabled people to share their opinions
regarding elections. Social media has provided a platform for collecting a
large amount of social media data. This paper proposes a computational public
opinion mining approach to explore the discussion of economic issues in social
media during an election. Current related studies use text mining methods
independently for election analysis and election prediction; this research
combines two text mining methods: sentiment analysis and topic modeling. The
proposed approach has effectively been deployed on millions of tweets to
analyze economic concerns of people during the 2012 US presidential election
Conservation science in NOAA’s National Marine Sanctuaries: description and recent accomplishments
This report describes cases relating to the management of national marine sanctuaries in which certain scientific information was required so managers could make decisions that effectively protected trust resources. The cases presented represent only a fraction of difficult issues that marine sanctuary managers deal with daily. They include, among others, problems related to wildlife disturbance, vessel routing, marine reserve placement, watershed management, oil spill response, and habitat restoration. Scientific approaches to address these problems vary significantly, and include literature surveys, data mining, field studies (monitoring, mapping, observations, and measurement), geospatial and biogeographic analysis, and modeling. In most cases there is also an element of expert consultation and collaboration among multiple partners, agencies with resource protection responsibilities, and other users and stakeholders. The resulting management responses may involve direct intervention (e.g., for spill response or habitat restoration issues), proposal of boundary alternatives for marine sanctuaries or reserves, changes in agency policy or regulations, making recommendations to other agencies with resource protection responsibilities, proposing changes to international or domestic shipping rules, or development of new education or outreach programs. (PDF contains 37 pages.
Challenging Issues of Spatio-Temporal Data Mining
The spatio-temporal database (STDB) has received considerable attention during the past few years, due to the emergence of numerous applications (e.g., flight control systems, weather forecast, mobile computing, etc.) that demand efficient management of moving objects. These applications record objects' geographical locations (sometimes also shapes) at various timestamps and support queries that explore their historical and future (predictive) behaviors. The STDB significantly extends the traditional spatial database, which deals with only stationary data and hence is inapplicable to moving objects, whose dynamic behavior requires re-investigation of numerous topics including data modeling, indexes, and the related query algorithms. In many application areas, huge amounts of data are generated, explicitly or implicitly containing spatial or spatiotemporal information. However, the ability to analyze these data remains inadequate, and the need for adapted data mining tools becomes a major challenge. In this paper, we have presented the challenging issues of spatio-temporal data mining. Keywords: database, data mining, spatial, temporal, spatio-tempora
Examining Granular Computing from a Modeling Perspective
In this paper, we use a set of unified components to conduct granular modeling for problem solving paradigms in several fields of computing. Each identified component may represent a potential research direction in the field of granular computing. A granular computing model for information analysis is proposed. The model may suggest that granular computing is an instrument for implementing perception based computing based on numeric computing. In addition, a novel granular language modeling technique is proposed for information extraction from web pages. This paper also suggests that the study of data mining in the framework of granular computing may address the issues of interpretability and usage of discovered patterns
Analysis of WEKA data mining algorithms Bayes net, random forest, MLP and SMO for heart disease prediction system: A case study in Iraq
Data mining is defined as a search through large amounts of data for valuable information. The association rules, grouping, clustering, prediction, sequence modeling is some essential and most general strategies for data extraction. The processing of data plays a major role in the healthcare industry's disease detection. A variety of disease evaluations should be required to diagnose the patient. However, using data mining strategies, the number of examinations should be decreased. This decreased examination plays a crucial role in terms of time and results. Heart disease is a death-provoking disorder. In this recent instance, health issues are immense because of the availability of health issues and the grouping of various situations. Today, secret information is important in the healthcare industry to make decisions. For the prediction of cardiovascular problems, (Weka 3.8.3) tools for this analysis are used for the prediction of data extraction algorithms like sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net. The data collected combine the prediction accuracy results, the receiver operating characteristic (ROC) curve, and the PRC value. The performance of Bayes net (94.5%) and random forest (94%) technologies indicates optimum performance rather than the sequential minimal optimization (SMO) and multilayer perceptron (MLP) methods
RANDOMIZATION BASED PRIVACY PRESERVING CATEGORICAL DATA ANALYSIS
The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive infor- mation of individuals, the public has expressed a deep concern about their privacy. Privacy-preserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining.
This dissertation investigates data utility and privacy of randomization-based mod- els in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for associ- ation rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis.
We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective associ- ation measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information.
Data privacy and data utility are commonly considered as a pair of conflicting re- quirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient so- lutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomiza- tion approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss
Recommended from our members
Application of Data Mining in Air Traffic Forecasting
The main goal of the study centers on developing a model for the purpose of air traffic forecasting by using off-the-shelf data mining and machine learning techniques. Although data driven modeling has been extensively applied in the aviation sector, little research has been done in the area of air traffic forecasting. This study is inspired by previous research focused on improving the Federal Aviation Administration (FAA) Terminal Area Forecasting (TAF) methodology, which historically assumed that the US air transportation system (ATS) network structure was static. Recent developments use data mining algorithms to predict the likelihood of previously un-connected airport-pairs being connected in the future, and the likelihood of connected airport-pairs becoming un-connected. Despite the innovation of this research, it does not focus on improving the FAA’s existing methodology for forecasting future air traffic levels on existing routes, which is based on relatively simple regression and growth models. We investigate different approaches for improving and developing new features within the existing data mining applications in air traffic forecasting. We focus particularly on predicting detailed traffic information for the US ATS. Initially, a 2-stage log-log model is applied to establish the significance of different inputs and to identify issues of endogeneity and multi-colinearity, while maintaining the simplicity of current models. Although the model shows high goodness of fit, it tested positive for both mentioned issues as well as presenting problems with causality. With the objective of solving these issues, a 3-stage model that is under development is introduced. This model employs logistic regression and discrete choice modelling. As part of future work, machine learning techniques such as clustering and neural networks will be applied to improve this model’s performance
- …