12 research outputs found

    A user preference perception model using data mining on a Web-based Environment

    Get PDF
    In a competitive environment, how to provide the information and products to meet the requirements of customers and improve the customer satisfaction will be the key criteria to measure a company’s competitiveness. Customer Relationship Management (CRM) becomes an important issue in any business market gradually. Using information technology, businesses can achieve their requirements for one to one marketing more efficiently with lower cost, labor and time. In this paper, we proposed a user preference perception model by using data mining technology on a web-based environment. First, the users’ web browse records are aggregated. Second, fuzzy set theory and most sequential pattern mining algorithm are used to infer the users’ preference changes in a period. After the test had processed, we use the on-line questionnaire to investigate the customer satisfaction degree from all participators. The results show that the degree of satisfaction was up to 72% for receiving the new information of participants whose preferences had been changed. It indicates that the proposed system can effectively perceive the change of preference for users on a web environment

    Mining Posets from Linear Orders

    Get PDF
    There has been much research on the combinatorial problem of generating the linear extensions of a given poset. This paper focuses on the reverse of that problem, where the input is a set of linear orders, and the goal is to construct a poset or set of posets that generates the input. Such a problem ïŹnds applications in computational neuroscience, systems biology, paleontology, and physical plant engineering. In this paper, several algorithms are presented for efficiently ïŹnding a single poset that generates the input set of linear orders. The variation of the problem where a minimum set of posets that cover the input is also explored. It is found that the problem is polynomially solvable for one class of simple posets (kite(2) posets) but NP-complete for a related class (hammock(2,2,2) posets)

    Classification of Clinical Tweets Using Apache Mahout

    Get PDF
    Title from PDF of title page, viewed on July 31, 2015Thesis advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 54-58)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015There is an increasing amount of healthcare related data available on Twitter. Due to Twitter’s popularity, every day large amount of clinical tweets are posted on this microblogging service platform. One interesting problem we face today is the classification of clinical tweets so that the classified tweets can be readily consumed by new healthcare applications. While there are several tools available to classify small datasets, the size of Twitter data demands new tools and techniques for fast and accurate classification. Motivated by these reasons, we propose a new tool called Clinical Tweets Classifier (CTC) to enable scalable classification of clinical content on Twitter. CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets, it also leverages the SNOMED CT clinical terminology and a new tweet influence scoring scheme to construct high accuracy models for classification. CTC uses the Naïve Bayes algorithm. We trained four models based on different feature sets such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We selected the training and test datasets based on the influence score of the tweets. We validated the accuracy of these models using a large number of tweets. Our results show that using SNOMET CT terms and a training dataset with more influential tweets, yields the most accurate model for classification. We also tested the scalability of CTC using 100 million tweets in a small cluster.Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future wor

    A Comparison of Machine Learning and Traditional Demand Forecasting Methods

    Get PDF
    Obtaining accurate forecasts has been a challenging task to achieve for many organizations, both public and private. Today, many firms choose to share their internal information with supply chain partners to increase planning efficiency and accuracy in the hopes of making appropriate critical decisions. However, forecast errors can still increase costs and reduce profits. As company datasets likely contain both trend and seasonal behavior, this motivates the need for computational resources to find the best parameters to use when forecasting their data. In this thesis, two industrial datasets are examined using both traditional and machine learning (ML) forecasting methods. The traditional methods considered are moving average, exponential smoothing, and autoregressive integrated moving average (ARIMA) models, while K-nearest neighbor, random forests, and neural networks were the ML techniques explored. Experimental results confirm the importance of performing a parametric grid search when using any forecasting method, as the output of this process directly determines the effectiveness of each model. In general, ML models are shown to be powerful tools for analyzing industrial datasets

    Techniques for improving clustering and association rules mining from very large transactional databases

    Get PDF
    Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the user’s expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work

    Can we accelerate medicinal chemistry by augmenting the chemist with Big Data and artificial intelligence?

    Get PDF
    It is both the best of times and the worst of times to be a medicinal chemist. Massive amounts of data combined with machine-learning and/or artificial intelligence (AI) tools to analyze it can increase our capabilities. However, drug discovery faces severe economic pressure and a high level of societal need set against challenging targets. Here, we show how improving medicinal chemistry by better curating and exchanging knowledge can contribute to improving drug hunting in all disease areas. Although securing intellectual property (IP) is a critical task for medicinal chemists, it impedes the sharing of generic medicinal chemistry knowledge. Recent developments enable the sharing of knowledge both within and between organizations while securing IP. We also explore the effects of the structure of the corporate ecosystem within drug discovery on knowledge sharing

    Big Data and Causality

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Causality analysis continues to remain one of the fundamental research questions and the ultimate objective for a tremendous amount of scientific studies. In line with the rapid progress of science and technology, the age of big data has significantly influenced the causality analysis on various disciplines especially for the last decade due to the fact that the complexity and difficulty on identifying causality among big data has dramatically increased. Data mining, the process of uncovering hidden information from big data is now an important tool for causality analysis, and has been extensively exploited by scholars around the world. The primary aim of this paper is to provide a concise review of the causality analysis in big data. To this end the paper reviews recent significant applications of data mining techniques in causality analysis covering a substantial quantity of research to date, presented in chronological order with an overview table of data mining applications in causality analysis domain as a reference directory

    Biometric fusion methods for adaptive face recognition in computer vision

    Get PDF
    PhD ThesisFace recognition is a biometric method that uses different techniques to identify the individuals based on the facial information received from digital image data. The system of face recognition is widely used for security purposes, which has challenging problems. The solutions to some of the most important challenges are proposed in this study. The aim of this thesis is to investigate face recognition across pose problem based on the image parameters of camera calibration. In this thesis, three novel methods have been derived to address the challenges of face recognition and offer solutions to infer the camera parameters from images using a geomtric approach based on perspective projection. The following techniques were used: camera calibration CMT and Face Quadtree Decomposition (FQD), in order to develop the face camera measurement technique (FCMT) for human facial recognition. Facial information from a feature extraction and identity-matching algorithm has been created. The success and efficacy of the proposed algorithm are analysed in terms of robustness to noise, the accuracy of distance measurement, and face recognition. To overcome the intrinsic and extrinsic parameters of camera calibration parameters, a novel technique has been developed based on perspective projection, which uses different geometrical shapes to calibrate the camera. The parameters used in novel measurement technique CMT that enables the system to infer the real distance for regular and irregular objects from the 2-D images. The proposed system of CMT feeds into FQD to measure the distance between the facial points. Quadtree decomposition enhances the representation of edges and other singularities along curves of the face, and thus improves directional features from face detection across face pose. The proposed FCMT system is the new combination of CMT and FQD to recognise the faces in the various pose. The theoretical foundation of the proposed solutions has been thoroughly developed and discussed in detail. The results show that the proposed algorithms outperform existing algorithms in face recognition, with a 2.5% improvement in main error recognition rate compared with recent studies
    corecore