802 research outputs found

    Investigation of Heterogeneous Approach to Fact Invention of Web Users’ Web Access Behaviour

    Get PDF
    World Wide Web consists of a huge volume of different types of data. Web mining is one of the fields of data mining wherein there are different web services and a large number of web users. Web user mining is also one of the fields of web mining. The web users’ information about the web access is collected through different ways. The most common technique to collect information about the web users is through web log file. There are several other techniques available to collect web users’ web access information; they are through browser agent, user authentication, web review, web rating, web ranking and tracking cookies. The web users find it difficult to retrieve their required information in time from the web because of the huge volume of unstructured and structured information which increases the complexity of the web. Web usage mining is very much important for various purposes such as organizing website, business and maintenance service, personalization of website and reducing the network bandwidth. This paper provides an analysis about the web usage mining techniques. Â

    Improved Pre-Processing Stages in Web Usage Mining Using Web Log

    Get PDF
    Enormous growth in the web persists both in number of web sites and number of users. The growth generated large volume of data in during user’s interaction with the web site and recorded in web logs. Web site owners need to understand about their users by accessing these web logs. Web mining perks up to comprehend range of concepts of diverse fields. Web Usage Mining (WUM) is the recent research field that it corresponds to the process of Knowledge Discovery in Databases (KDD). It comprises three main categories: Pre-Processing, Pattern Analysis, Pattern Discovery. WUM extracts behavioral data from web users data and if possible from web site information (structure and content). In this paper, we propose a customized application specific methodology for preprocessing the Web logs and combining WUM with Association Rule Mining

    DCU-TCD@LogCLEF 2010: re-ranking document collections and query performance estimation

    Get PDF
    This paper describes the collaborative participation of Dublin City University and Trinity College Dublin in LogCLEF 2010. Two sets of experiments were conducted. First, different aspects of the TEL query logs were analysed after extracting user sessions of consecutive queries on a topic. The relation between the queries and their length (number of terms) and position (first query or further reformulations) was examined in a session with respect to query performance estimators such as query scope, IDF-based measures, simplified query clarity score, and average inverse document collection frequency. Results of this analysis suggest that only some estimator values show a correlation with query length or position in the TEL logs (e.g. similarity score between collection and query). Second, the relation between three attributes was investigated: the user's country (detected from IP address), the query language, and the interface language. The investigation aimed to explore the influence of the three attributes on the user's collection selection. Moreover, the investigation involved assigning different weights to the three attributes in a scoring function that was used to re-rank the collections displayed to the user according to the language and country. The results of the collection re-ranking show a significant improvement in Mean Average Precision (MAP) over the original collection ranking of TEL. The results also indicate that the query language and interface language have more in uence than the user's country on the collections selected by the users

    Learning user behaviours from website visit profiling

    Get PDF
    El proyecto consiste en el diseño e implementación de un programa que analiza,a través de los registros o logs, el tráfico y los usuarios de servidores web. En concreto el proyecto pone énfasis en la generación automática de modelos para poder analizar comportamientos de los usuarios

    You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

    Full text link
    Understanding how people interact with the web is key for a variety of applications, e.g., from the design of effective web pages to the definition of successful online marketing campaigns. Browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are web pages, and edges are the paths followed by users. Obtaining large and representative data to extract clickstreams is however challenging. The evolution of the web questions whether browsing behavior is changing and, by consequence, whether properties of clickstreams are changing. This paper presents a longitudinal study of clickstreams in from 2013 to 2016. We evaluate an anonymized dataset of HTTP traces captured in a large ISP, where thousands of households are connected. We first propose a methodology to identify actual URLs requested by users from the massive set of requests automatically fired by browsers when rendering web pages. Then, we characterize web usage patterns and clickstreams, taking into account both the temporal evolution and the impact of the device used to explore the web. Our analyses precisely quantify various aspects of clickstreams and uncover interesting patterns, such as the typical short paths followed by people while navigating the web, the fast increasing trend in browsing from mobile devices and the different roles of search engines and social networks in promoting content. Finally, we contribute a dataset of anonymized clickstreams to the community to foster new studies (anonymized clickstreams are available to the public at http://bigdata.polito.it/clickstream).Comment: 30 page

    Analysis of Clickstream Data

    Get PDF
    This thesis is concerned with providing further statistical development in the area of web usage analysis to explore web browsing behaviour patterns. We received two data sources: web log files and operational data files for the websites, which contained information on online purchases. There are many research question regarding web browsing behaviour. Specifically, we focused on the depth-of-visit metric and implemented an exploratory analysis of this feature using clickstream data. Due to the large volume of data available in this context, we chose to present effect size measures along with all statistical analysis of data. We introduced two new robust measures of effect size for two-sample comparison studies for Non-normal situations, specifically where the difference of two populations is due to the shape parameter. The proposed effect sizes perform adequately for non-normal data, as well as when two distributions differ from shape parameters. We will focus on conversion analysis, to investigate the causal relationship between the general clickstream information and online purchasing using a logistic regression approach. The aim is to find a classifier by assigning the probability of the event of online shopping in an e-commerce website. We also develop the application of a mixture of hidden Markov models (MixHMM) to model web browsing behaviour using sequences of web pages viewed by users of an e-commerce website. The mixture of hidden Markov model will be performed in the Bayesian context using Gibbs sampling. We address the slow mixing problem of using Gibbs sampling in high dimensional models, and use the over-relaxed Gibbs sampling, as well as forward-backward EM algorithm to obtain an adequate sample of the posterior distributions of the parameters. The MixHMM provides an advantage of clustering users based on their browsing behaviour, and also gives an automatic classification of web pages based on the probability of observing web page by visitors in the website

    Assessing Post Usage for Measuring the Quality of Forum Posts

    Get PDF
    It has become difficult to discover quality content within forums websites due to the increasing amount of UserGenerated Content (UGC) on the Web. Many existing websites have relied on their users to explicitly rate content quality. The main problem with this approach is that the majority of content often receives insufficient rating. Current automated content rating solutions have evaluated linguistic features of UGC but are less effective for different types of online communities. We propose a novel approach that assesses post usage to measure the quality of forum posts. Post usage can be viewed as implicit user ratings derived from their usage behaviour. The proposed model is validated against an operational forum using Matthews Correlation Coefficient to measure performance. Our model serves as a basis of exploring content usage to measure content quality in forums and other Web 2.0 platforms
    corecore