216,648 research outputs found

    cANT WUM : WEB USERS CLASSIFICATION USING ANT COLONY OPTIMIZATION ALGORITHM

    Get PDF
    Web Usage Mining (WUM) is the use of data mining methods to extract knowledge from web usage data. One function of WUM is to support Business Intelligence (BI) purpose in which one of the important information needed is the classification of web users that can be used for acquisition, penetration, and user retention activity. There are two main problems encountered in conducting the classification of web users. The first is the determination of antecedent attributes as a term of classification rules, which is a major problem in data mining classification function in general. The second problem is the preprocessing activity which involves preparing the supporting data for the web users’ classification need which is the most difficult stage in WUM. For the web user classification method, we propose a classification method based on ant colony optimization method (ACO) as a distributed intelligent system using heuristic function which is in line with the problem areas. We proposed a heuristic functions for web user classification based on web usage data that uses entropy of antecedent candidate, information gain from attribute of total number of web user access  and average of access duration of web user.  For preprocessing purpose, a method of data preparation that can support the needs of web users’ classification is proposed. The data used consists of web access log data, web user profile data and web transaction data. The preprocessing activity consists of parsing, data cleansing, and extraction of the web user sessions using heuristic method concerning web page access timeout and differences in web browser agent. Testing is done by comparing the performance of the proposed algorithm with Ant-Miner algorithm, cAnt-Miner algorithm, and the Continuous Ant-Miner algorithm. The results of testing of four web data shows that the performance of the proposed algorithm is better in terms of accuracy of rules and simplification of rules

    Association Rule Based Classification

    Get PDF
    In this thesis, we focused on the construction of classification models based on association rules. Although association rules have been predominantly used for data exploration and description, the interest in using them for prediction has rapidly increased in the data mining community. In order to mine only rules that can be used for classification, we modified the well known association rule mining algorithm Apriori to handle user-defined input constraints. We considered constraints that require the presence/absence of particular items, or that limit the number of items, in the antecedents and/or the consequents of the rules. We developed a characterization of those itemsets that will potentially form rules that satisfy the given constraints. This characterization allows us to prune during itemset construction itemsets such that neither they nor any of their supersets will form valid rules. This improves the time performance of itemset construction. Using this characterization, we implemented a classification system based on association rules and compared the performance of several model construction methods, including CBA, and several model deployment modes to make predictions. Although the data mining community has dealt only with the classification of single-valued attributes, there are several domains in which the classification target is set-valued. Hence, we enhanced our classification system with a novel approach to handle the prediction of set-valued class attributes. Since the traditional classification accuracy measure is inappropriate in this context, we developed an evaluation method for set-valued classification based on the E-Measure. Furthermore, we enhanced our algorithm by not relying on the typical support/confidence framework, and instead mining for the best possible rules above a user-defined minimum confidence and within a desired range for the number of rules. This avoids long mining times that might produce large collections of rules with low predictive power. For this purpose, we developed a heuristic function to determine an initial minimum support and then adjusted it using a binary search strategy until a number of rules within the given range was obtained. We implemented all of our techniques described above in WEKA, an open source suite of machine learning algorithms. We used several datasets from the UCI Machine Learning Repository to test and evaluate our techniques

    Analysis of transit quality of service through segmentation and classification tree techniques

    Get PDF
    Perceptions about the quality of service are very different among public transport (PT) users. Users’ perceptions are heterogeneous for many reasons: the qualitative aspects of PT service, users’ socio-economic characteristics, and the diversity of tastes and attitudes towards PT. By analysing different groups of users who share a common characteristic (e.g. socio-economic or travel behaviour), it is possible to homogenise user opinions about the quality of service. This paper studies quality as perceived by users of the metropolitan transit system of Granada (Spain) through a classification tree technique (classification and regression trees (CART)) based on five market segmentations (gender, age, frequency of use, reason for travelling, and type of ticket). CART is a non-parametric method that has a number of advantages compared to other methods that require a predefined underlying relationship between dependent and independent variables. The study is based on data gathered in several customer satisfaction surveys (non-research-oriented) conducted in the Granada metropolitan transit system. The models' outcomes show that some attributes are very important for almost all the market segments (punctuality and information), while others are not very relevant for any of the segments – most notably fare, despite the fact that fare was stated as very important by most of the passengers during the interviewConserjería de Innovación, Ciencia y Economía of the Junta de Andalucía (Spain) through the Excellence Research Project denominated “Q-METROBUS-Quality of service indicator for METROpolitan public BUS transport services”

    The applicability of a use value-based file retention method

    Get PDF
    The determination of the relative value of files is important for an organization while determining a retrieval service level for its files and a corresponding file retention policy. This paper discusses via a literature review methods for developing file retention policies based on the use values of files. On basis of these results we propose an enhanced version of one of them. In a case study, we demonstrate how one can develop a customized file retention policy by testing causal relations between file parameters and the use value of files. This case shows that, contrary to suggestions of previous research, the file type has no significant relation with the value of a file and thus should be excluded from a retention policy in this case. The case study also shows a strong relation between the position of a file user and the value of this file. Furthermore, we have improved the Information Value Questionnaire (IVQ) for subjective valuation of files. However, the resulting method needs software to be efficient in its application. Therefore, we developed a prototype for the automatic execution of a file retention policy. We conclude with a discussio

    PRESISTANT: Learning based assistant for data pre-processing

    Get PDF
    Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

    Recommending with an Agenda: Active Learning of Private Attributes using Matrix Factorization

    Full text link
    Recommender systems leverage user demographic information, such as age, gender, etc., to personalize recommendations and better place their targeted ads. Oftentimes, users do not volunteer this information due to privacy concerns, or due to a lack of initiative in filling out their online profiles. We illustrate a new threat in which a recommender learns private attributes of users who do not voluntarily disclose them. We design both passive and active attacks that solicit ratings for strategically selected items, and could thus be used by a recommender system to pursue this hidden agenda. Our methods are based on a novel usage of Bayesian matrix factorization in an active learning setting. Evaluations on multiple datasets illustrate that such attacks are indeed feasible and use significantly fewer rated items than static inference methods. Importantly, they succeed without sacrificing the quality of recommendations to users.Comment: This is the extended version of a paper that appeared in ACM RecSys 201

    Preference Networks: Probabilistic Models for Recommendation Systems

    Full text link
    Recommender systems are important to help users select relevant and personalised information over massive amounts of data available. We propose an unified framework called Preference Network (PN) that jointly models various types of domain knowledge for the task of recommendation. The PN is a probabilistic model that systematically combines both content-based filtering and collaborative filtering into a single conditional Markov random field. Once estimated, it serves as a probabilistic database that supports various useful queries such as rating prediction and top-NN recommendation. To handle the challenging problem of learning large networks of users and items, we employ a simple but effective pseudo-likelihood with regularisation. Experiments on the movie rating data demonstrate the merits of the PN.Comment: In Proc. of 6th Australasian Data Mining Conference (AusDM), Gold Coast, Australia, pages 195--202, 200

    Listening between the Lines: Learning Personal Attributes from Conversations

    Full text link
    Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.Comment: published in WWW'1
    • …
    corecore