174 research outputs found

    Memory-based preferential choice in large option spaces

    Get PDF
    Whether adding songs to a playlist or groceries to a shopping basket, everyday decisions often require us to choose between an innumerable set of options. Laboratory studies of preferential choice have made considerable progress in describing how people navigate fixed sets of options. Yet, questions remain about how well this generalises to more complex, everyday choices. In this thesis, I ask how people navigate large option spaces, focusing particularly on how long-term memory supports decisions. In the first project, I explore how large option spaces are structured in the mind. A topic model trained on the purchasing patterns of consumers uncovered an intuitive set of themes that centred primarily around goals (e.g., tomatoes go well in a salad), suggesting that representations are geared to support action. In the second project, I explore how such representations are queried during memory-based decisions, where options must be retrieved from memory. Using a large dataset of over 100,000 online grocery shops, results revealed that consumers query multiple systems of associative memory when determining what choose next. Attending to certain knowledge sources, as estimated by a cognitive model, predicted important retrieval errors, such as the propensity to forget or add unwanted products. In the final project, I ask how preferences could be learned and represented in large option spaces, where most options are untried. A cognitive model of sequential decision making is proposed, which learns preferences over choice attributes, allowing for the generalisation of preferences to unseen options, by virtue of their similarity to previous choices. This model explains reduced exploration patterns behaviour observed in the supermarket and preferential choices in more controlled laboratory settings. Overall, this suggests that consumers depend on associative systems in long-term memory when navigating large spaces of options, enabling inferences about the conceptual properties and subjective value of novel options

    Clickstream Data Analysis: A Clustering Approach Based on Mixture Hidden Markov Models

    Get PDF
    Nowadays, the availability of devices such as laptops and cell phones enables one to browse the web at any time and place. As a consequence, a company needs to have a website so as to maintain or increase customer loyalty and reach potential new customers. Besides, acting as a virtual point-of-sale, the company portal allows it to obtain insights on potential customers through clickstream data, web generated data that track users accesses and activities in websites. However, these data are not easy to handle as they are complex, unstructured and limited by lack of clear information about user intentions and goals. Clickstream data analysis is a suitable tool for managing the complexity of these datasets, obtaining a cleaned and processed sequential dataframe ready to identify and analyse patterns. Analysing clickstream data is important for companies as it enables them to under stand differences in web user behaviour while they explore websites, how they move from one page to another and what they select in order to define business strategies tar geting specific types of potential costumers. To obtain this level of insight it is pivotal to understand how to exploit hidden information related to clickstream data. This work presents the cleaning and pre-processing procedures for clickstream data which are needed to get a structured sequential dataset and analyses these sequences by the application of Mixture of discrete time Hidden Markov Models (MHMMs), a statisti cal tool suitable for clickstream data analysis and profile identification that has not been widely used in this context. Specifically, hidden Markov process accounts for a time varying latent variable to handle uncertainty and groups together observed states based on unknown similarity and entails identifying both the number of mixture components re lating to the subpopulations as well as the number of latent states for each latent Markov chain. However, the application of MHMMs requires the identification of both the number of components and states. Information Criteria (IC) are generally used for model selection in mixture hidden Markov models and, although their performance has been widely studied for mixture models and hidden Markov models, they have received little attention in the MHMM context. The most widely used criterion is BIC even if its performance for these models depends on factors such as the number of components and sequence length. Another class of model selection criteria is the Classification Criteria (CC). They were defined specifically for clustering purposes and rely on an entropy measure to account for separability between groups. These criteria are clearly the best option for our purpose, but their application as model selection tools for MHMMs requires the definition of a suitable entropy measure. In the light of these considerations, this work proposes a classification criterion based on an integrated classification likelihood approach for MHMMs that accounts for the two latent classes in the model: the subpopulations and the hidden states. This criterion is a modified ICL BIC, a classification criterion that was originally defined in the mixture model context and used in hidden Markov models. ICL BIC is a suitable score to identify the number of classes (components or states) and, thus, to extend it to MHMMs we de fined a joint entropy accounting for both a component-related entropy and a state-related conditional entropy. The thesis presents a Monte Carlo simulation study to compare selection criteria per formance, the results of which point out the limitations of the most commonly used infor mation criteria and demonstrate that the proposed criterion outperforms them in identify ing components and states, especially in short length sequences which are quite common in website accesses. The proposed selection criterion was applied to real clickstream data collected from the website of a Sicilian company operating in the hospitality sector. Data was modelled by an MHMM identifying clusters related to the browsing behaviour of web users which provided essential indications for developing new business strategies. This thesis is structured as follows: after an introduction on the main topics in Chapter 1, we present the clickstream data and their cleaning and pre-processing steps in Chapter 2; Chapter 3 illustrates the structure and estimation algorithms of mixture hidden Markov models; Chapter 4 presents a review of model selection criteria and the definition of the proposed ICL BIC for MHMMs; the real clickstream data analysis follows in Chapter 5

    The Utilization of Data Analysis Techniques in Predicting Student Performance in Massive Open Online Courses (MOOCs)

    Get PDF
    The growth of the Internet has enabled the popularity of open online learning platforms to increase over the years. This has led to the inception of Massive Open Online Courses (MOOCs) that enrol, millions of people, from all over the world. Such courses operate under the concept of open learning, where content does not have to be delivered via standard mechanisms that institutions employ, such as physically attending lectures. Instead learning occurs online via recorded lecture material and online tasks. This shift has allowed more people to gain access to education, regardless of their learning background. However, despite these advancements in delivering education, completion rates for MOOCs are low. In order to investigate this issue, the paper explores the impact that technology has on open learning and identifies how data about student performance can be captured to predict trend so that at risk students can be identified before they drop-out. In achieving this, subjects surrounding student engagement and performance in MOOCs and data analysis techniques are explored to investigate how technology can be used to address this issue. The paper is then concluded with our approach of predicting behaviour and a case study of the eRegister system, which has been developed to capture and analyse data. Keywords: Open Learning; Prediction; Data Mining; Educational Systems; Massive Open Online Course; Data Analysi

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

    Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

    Get PDF
    In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag

    Automating website profiling for a deep web search engine

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 53-55).The deep web consists of information on the internet that resides in databases or is dynamically generated. It is believed that the deep web represents a large percentage of the total contents on the web, but is currently not indexed by traditional search engines. The Morpheus project is designed to solve this problem by making information in the deep web searchable. This requires a large repository of content sources to be built up, where each source is represented in Morpheus by a profile or wrapper. This research proposes an approach to automating the creation of wrappers by relying on the average internet user to identify relevant sites. A wrapper generation system was created based on this approach. It comprises two components: the clickstream recorder saves characteristic data for websites identified by users, and the wrapper constructor converts these data into wrappers for the Morpheus system. After completing the implementation of this system, user tests were conducted, which verified that the system is effective and has good usability.by Jeffrey W. Yuan.M.Eng

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time
    • …
    corecore