12 research outputs found
A user preference perception model using data mining on a Web-based Environment
In a competitive environment, how to provide the information and products to meet the requirements of customers and improve the customer satisfaction will be the key criteria to measure a companyâs competitiveness. Customer Relationship Management (CRM) becomes an important issue in any business market gradually. Using information technology, businesses can achieve their requirements for one to one marketing more efficiently with lower cost, labor and time.
In this paper, we proposed a user preference perception model by using data mining technology on a web-based environment. First, the usersâ web browse records are aggregated. Second, fuzzy set theory and most sequential pattern mining algorithm are used to infer the usersâ preference changes in a period. After the test had processed, we use the on-line questionnaire to investigate the customer satisfaction degree from all participators. The results show that the degree of satisfaction was up to 72% for receiving the new information of participants whose preferences had been changed. It indicates that the proposed system can effectively perceive the change of preference for users on a web environment
Mining Posets from Linear Orders
There has been much research on the combinatorial problem of generating the linear extensions of a given poset. This paper focuses on the reverse of that problem, where the input is a set of linear orders, and the goal is to construct a poset or set of posets that generates the input. Such a problem ïŹnds applications in computational neuroscience,
systems biology, paleontology, and physical plant engineering. In this paper, several algorithms are presented for efficiently ïŹnding a single poset that generates the input
set of linear orders. The variation of the problem where a minimum set of posets that cover the input is also explored. It is found that the problem is polynomially
solvable for one class of simple posets (kite(2) posets) but NP-complete for a related class (hammock(2,2,2) posets)
Classification of Clinical Tweets Using Apache Mahout
Title from PDF of title page, viewed on July 31, 2015Thesis advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 54-58)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015There is an increasing amount of healthcare related data available on
Twitter. Due to Twitterâs popularity, every day large amount of clinical tweets are
posted on this microblogging service platform. One interesting problem we face
today is the classification of clinical tweets so that the classified tweets can be
readily consumed by new healthcare applications. While there are several tools
available to classify small datasets, the size of Twitter data demands new tools and
techniques for fast and accurate classification.
Motivated by these reasons, we propose a new tool called Clinical Tweets
Classifier (CTC) to enable scalable classification of clinical content on Twitter.
CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets,
it also leverages the SNOMED CT clinical terminology and a new tweet influence
scoring scheme to construct high accuracy models for classification. CTC uses the
NaĂŻve Bayes algorithm. We trained four models based on different feature sets
such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We
selected the training and test datasets based on the influence score of the tweets.
We validated the accuracy of these models using a large number of tweets.
Our results show that using SNOMET CT terms and a training dataset with
more influential tweets, yields the most accurate model for classification. We also
tested the scalability of CTC using 100 million tweets in a small cluster.Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future wor
A Comparison of Machine Learning and Traditional Demand Forecasting Methods
Obtaining accurate forecasts has been a challenging task to achieve for many organizations, both public and private. Today, many firms choose to share their internal information with supply chain partners to increase planning efficiency and accuracy in the hopes of making appropriate critical decisions. However, forecast errors can still increase costs and reduce profits. As company datasets likely contain both trend and seasonal behavior, this motivates the need for computational resources to find the best parameters to use when forecasting their data. In this thesis, two industrial datasets are examined using both traditional and machine learning (ML) forecasting methods. The traditional methods considered are moving average, exponential smoothing, and autoregressive integrated moving average (ARIMA) models, while K-nearest neighbor, random forests, and neural networks were the ML techniques explored. Experimental results confirm the importance of performing a parametric grid search when using any forecasting method, as the output of this process directly determines the effectiveness of each model. In general, ML models are shown to be powerful tools for analyzing industrial datasets
Techniques for improving clustering and association rules mining from very large transactional databases
Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the userâs expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work
Can we accelerate medicinal chemistry by augmenting the chemist with Big Data and artificial intelligence?
It is both the best of times and the worst of times to be a medicinal chemist. Massive amounts of data combined with machine-learning and/or artificial intelligence (AI) tools to analyze it can increase our capabilities. However, drug discovery faces severe economic pressure and a high level of societal need set against challenging targets. Here, we show how improving medicinal chemistry by better curating and exchanging knowledge can contribute to improving drug hunting in all disease areas. Although securing intellectual property (IP) is a critical task for medicinal chemists, it impedes the sharing of generic medicinal chemistry knowledge. Recent developments enable the sharing of knowledge both within and between organizations while securing IP. We also explore the effects of the structure of the corporate ecosystem within drug discovery on knowledge sharing
Big Data and Causality
The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Causality analysis continues to remain one of the fundamental research questions and the ultimate objective for a tremendous amount of scientific studies. In line with the rapid progress of science and technology, the age of big data has significantly influenced the causality analysis on various disciplines especially for the last decade due to the fact that the complexity and difficulty on identifying causality among big data has dramatically increased. Data mining, the process of uncovering hidden information from big data is now an important tool for causality analysis, and has been extensively exploited by scholars around the world. The primary aim of this paper is to provide a concise review of the causality analysis in big data. To this end the paper reviews recent significant applications of data mining techniques in causality analysis covering a substantial quantity of research to date, presented in chronological order with an overview table of data mining applications in causality analysis domain as a reference directory
Biometric fusion methods for adaptive face recognition in computer vision
PhD ThesisFace recognition is a biometric method that uses different techniques to identify the individuals based on the facial information received from digital image data. The system of face recognition is widely used for security purposes, which has challenging problems. The solutions to some of the most important challenges are proposed in this study. The aim of this thesis is to investigate face recognition across pose problem based on the image parameters of camera calibration. In this thesis, three novel methods have been derived to address the challenges of face recognition and offer solutions to infer the camera parameters from images using a geomtric approach based on perspective projection. The following techniques were used: camera calibration CMT and Face Quadtree Decomposition (FQD), in order to develop the face camera measurement technique (FCMT) for human facial recognition.
Facial information from a feature extraction and identity-matching algorithm has been created. The success and efficacy of the proposed algorithm are analysed in terms of robustness to noise, the accuracy of distance measurement, and face recognition. To overcome the intrinsic and extrinsic parameters of camera calibration parameters, a novel technique has been developed based on perspective projection, which uses different geometrical shapes to calibrate the camera. The parameters used in novel measurement technique CMT that enables the system to infer the real distance for regular and irregular objects from the 2-D images. The proposed system of CMT feeds into FQD to measure the distance between the facial points. Quadtree decomposition enhances the representation of edges and other singularities along curves of the face, and thus improves directional features from face detection across face pose. The proposed FCMT system is the new combination of CMT and FQD to recognise the faces in the various pose.
The theoretical foundation of the proposed solutions has been thoroughly developed and discussed in detail. The results show that the proposed algorithms outperform existing algorithms in face recognition, with a 2.5% improvement in main error recognition rate compared with recent studies
Recommended from our members
A novel knowledge discovery based approach for supplier risk scoring with application in the HVAC industry
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThis research has led to a novel methodology for assessment and quantification of supply risks in the supply chain. The research has built on advanced Knowledge Discovery techniques and has resulted to a software implementation to be able to do so. The methodology developed and presented here resembles the well-known consumer credit scoring methods as it leads to a similar metric, or score, for assessing a supplierâs reliability and risk of conducting business with that supplier. However, the focus is on a wide range of operational metrics rather than just financial, which credit scoring techniques typically focus on.
The core of the methodology comprises the application of Knowledge Discovery techniques to extract the likelihood of possible risks from within a range of available datasets. In combination with cross-impact analysis, those datasets are examined for establish the inter-relationships and mutual connections among several factors that are likely contribute to risks associated with particular suppliers. This approach is called conjugation analysis. The resulting parameters become the inputs into a logistic regression which leads to a risk scoring model the outcome of the process is the standardized risk score which is analogous to the well-known consumer risk scoring model, better known as FICO score.
The proposed methodology has been applied to an Air Conditioning manufacturing company. Two models have been developed. The first identifies the supply risks based on the data about purchase orders and selected risk factors. With this model the likelihoods of delivery failures, quality failures and cost failures are obtained. The second model built on the first one but also used the actual data about the performance of supplier to identify risks of conducting business with particular suppliers. Its target was to provide quantitative measures of an individual supplierâs risk level.
The supplier risk scoring model is tested on the data acquired from the company for its performance analysis. The supplier risk scoring model achieved 86.2% accuracy, while the area under curve (AUC) was 0.863. The AUC curve is much higher than required modelâs validity threshold value of 0.5. It represents developed modelâs validity and reliability for future data. The numerical studies conducted with real-life datasets have demonstrated the effectiveness of the proposed methodology and system as well as its future potential for industrial adoption