12 research outputs found

    Clustering of Web Users Using Session-based Similarity Measures

    Get PDF
    One important research topic in web usage mining is the clustering of web users based on their common properties. Informative knowledge obtained from web user clusters were used for many applications, such as the prefetching of pages between web clients and proxies. This paper presents an approach for measuring similarity of interests among web users from their past access behaviors. The similarity measures are based on the user sessions extracted from the user\u27s access logs. A multi-level scheme for clustering a large number of web users is proposed, as an extension to the method proposed in our previous work (2001). Experiments were conducted and the results obtained show that our clustering method is capable of clustering web users with similar interest

    Similarity-aware Web Content Management and Document Pre-fetching

    Get PDF
    Web caching is intended to reduce network traffic, server load and user-perceived retrieval latency. Web pre-fetching, which can be considered as active caching, builds on regular Web caching, minimizing further a Web user\u27s access delay. To be effective, however, the pre-fetching techniques must be able to predict subsequent Web access with minimum computational overheads. This paper presents a similarity-based mechanism to support similarity-aware Web document pre-fetching between proxy caches and browsing clients. We first define a set of measures to assess similarities between Web documents, and then propose a multi-cache architecture to cache Web documents based on those similarities. A predictor is developed to support the similarity-aware document pre-fetching algorithm. Preliminary experiments have shown that our predictor offers superior performance when compared with some existing prediction algorithms

    Agent-based Similarity-aware Web Document Pre-fetching

    Get PDF
    This paper presents an agent-based similarity-aware Web document pre-fetching scheme that is built on the similarity-aware Web caching architecture. A set of agents are employed to carry out certain duties such as document similarity detection, identification of relevant access patterns, document prediction and network traffic monitoring for document pre-fetching. Preliminary simulations have been conducted to evaluate the proposed scheme, and the results have shown that the new pre-fetching scheme outperforms existing Web-document pre-fetching algorithm

    Noise Reduction In Web Data: A Learning Approach Based On Dynamic User Interests

    Get PDF
    One of the prominent challenges internet operators encounter is the abundance of extraneous material inside web content, hence impeding the efficient retrieval of relevant information aligned with their evolving interests. The present state of affairs. In academic research, noise is commonly defined as any extraneous data that does not contribute to the intended analysis or study objectives. This study aims to analyse the primary webpage and suggest noise reduction tools for online data. The primary emphasis is on the reduction of noise about the content and its associated factors. The arrangement or organisation of data on the internet. In this paper, including some data inside a dataset may not be universally applicable or appropriate. The web page's primary content pertains to the user's specific interests, while extraneous info is minimised. Noise can be perceived as disruptive or unwanted sound by an individual. Hence, the acquisition of noisy online data and the allocation of resources to user requests ensure not just a decrease in noise levels. There is an observed correlation between the level indicated in a user profile on the web and a reduction in the occurrence of valuable information loss. The inclusion of information consequently enhances the calibre of an online user profile. The phenomenon of noise refers to unwanted or disruptive sounds that can have negative effects on individuals and the Web Data Learning (NWDL) tool/algorithm exhibits the capacity to acquire knowledge. The proposal suggests the use of noise in web user profiles to enhance data privacy. The work that has suggested the removal of noise data in the context of dynamic user behaviour is being considered. The topic of interest is being discussed. To ascertain the efficacy of the proposed study, A presentation of an experimental design arrangement is provided. The results were achieved in contrast to the presently employed algorithms utilised in the context of noisy online data. The reducing process. The experimental findings indicate that the proposed study examines the dynamic evolution of user interest before the removal of extraneous data. The proposed study makes a significant contribution to Enhancing the calibre of an online user profile through the reduction of content volume. The elimination of noise results in the removal of beneficial information

    Discovering user access pattern based on probabilistic latent factor model

    Full text link
    There has been an increased demand for characterizing user access patterns using web mining techniques since the informative knowledge extracted from web server log files can not only offer benefits for web site structure improvement but also for better understanding of user navigational behavior. In this paper, we present a web usage mining method, which utilize web user usage and page linkage information to capture user access pattern based on Probabilistic Latent Semantic Analysis (PLSA) model. A specific probabilistic model analysis algorithm, EM algorithm, is applied to the integrated usage data to infer the latent semantic factors as well as generate user session clusters for revealing user access patterns. Experiments have been conducted on real world data set to validate the effectiveness of the proposed approach. The results have shown that the presented method is capable of characterizing the latent semantic factors and generating user profile in terms of weighted page vectors, which may reflect the common access interest exhibited by users among same session cluster. © 2005, Australian Computer Society, Inc

    A signature-based indexing method for efficient content-based retrieval of relative temporal patterns

    Get PDF

    Similarity-aware Web content management and document pre-fetching

    Full text link

    Distributed detection of anomalous internet sessions

    Get PDF
    Financial service providers are moving many services online reducing their costs and facilitating customers¿ interaction. Unfortunately criminals have quickly found several ways to avoid most security measures applied to browsers and banking sites. The use of highly dangerous malware has become the most significant threat and traditional signature-detection methods are nowadays easily circumvented due to the amount of new samples and the use of sophisticated evasion techniques. Antivirus vendors and malware experts are pushed to seek for new methodologies to improve the identification and understanding of malicious applications behavior and their targets. Financial institutions are now playing an important role by deploying their own detection tools against malware that specifically affect their customers. However, most detection approaches tend to base on sequence of bytes in order to create new signatures. This thesis approach is based on new sources of information: the web logs generated from each banking session, the normal browser execution and customers mobile phone behavior. The thesis can be divided in four parts: The first part involves the introduction of the thesis along with the presentation of the problems and the methodology used to perform the experimentation. The second part describes our contributions to the research, which are based in two areas: *Server side: Weblogs analysis. We first focus on the real time detection of anomalies through the analysis of web logs and the challenges introduced due to the amount of information generated daily. We propose different techniques to detect multiple threats by deploying per user and global models in a graph based environment that will allow increase performance of a set of highly related data. *Customer side: Browser analysis. We deal with the detection of malicious behaviors from the other side of a banking session: the browser. Malware samples must interact with the browser in order to retrieve or add information. Such relation interferes with the normal behavior of the browser. We propose to develop models capable of detecting unusual patterns of function calls in order to detect if a given sample is targeting an specific financial entity. In the third part, we propose to adapt our approaches to mobile phones and Critical Infrastructures environments. The latest online banking attack techniques circumvent protection schemes such password verification systems send via SMS. Man in the Mobile attacks are capable of compromising mobile devices and gaining access to SMS traffic. Once the Transaction Authentication Number is obtained, criminals are free to make fraudulent transfers. We propose to model the behavior of the applications related messaging services to automatically detect suspicious actions. Real time detection of unwanted SMS forwarding can improve the effectiveness of second channel authentication and build on detection techniques applied to browsers and Web servers. Finally, we describe possible adaptations of our techniques to another area outside the scope of online banking: critical infrastructures, an environment with similar features since the applications involved can also be profiled. Just as financial entities, critical infrastructures are experiencing an increase in the number of cyber attacks, but the sophistication of the malware samples utilized forces to new detection approaches. The aim of the last proposal is to demonstrate the validity of out approach in different scenarios. Conclusions. Finally, we conclude with a summary of our findings and the directions for future work

    Workload characterization and customer interaction at e-commerce web servers

    Get PDF
    Electronic commerce servers have a significant presence in today's Internet. Corporations want to maintain high availability, sufficient capacity, and satisfactory performance for their E-commerce Web systems, and want to provide satisfactory services to customers. Workload characterization and the analysis of customers' interactions with Web sites are the bases upon which to analyze server performance, plan system capacity, manage system resources, and personalize services at the Web site. To date, little empirical evidence has been discovered that identifies the characteristics for Web workloads of E-commerce systems and the behaviours of customers. This thesis analyzes the Web access logs at public Web sites for three organizations: a car rental company, an IT company, and the Computer Science department of the University of Saskatchewan. In these case studies, the characteristics of Web workloads are explored at the request level, functionlevel, resource level, and session level; customers' interactions with Web sites are analyzed by identifying and characterizing session groups. The main E-commerce Web workload characteristics and performance implications are: i) The requests for dynamic Web objects are an important part of the workload. These requests should be characterized separately since the system processes them differently; ii) Some popular image files, which are embedded in the same Web page, are always requested together. If these files are requested and sent in a bundle, a system will greatly reduce the overheads in processing requests for these files; iii) The percentage of requests for each Web page category tends to be stable in the workload when the time scale is large enough. This observation is helpful in forecasting workload composition; iv) the Secure Socket Layer protocol (SSL) is heavily used and most Web objects are either requested primarily through SSL or primarily not through SSL; and v) Session groups of different characteristics are identified for all logs. The analysis of session groups may be helpful in improving system performance, maximizing revenue throughput of the system, providing better services to customers, and managing and planning system resources. A hybrid clustering algorithm, which is a combination of the minimum spanning tree method and k-means clustering algorithm, is proposed to identify session clusters. Session clusters obtained using the three session representations Pages Requested, Navigation Pattern, and Resource Usage are similar enough so that it is possible to use different session representations interchangeably to produce similar groupings. The grouping based on one session representation is believed to be sufficient to answer questions in server performance, resource management, capacity planning and Web site personalization, which previously would have required multiple different groupings. Grouping by Pages Requested is recommended since it is the simplest and data on Web pages requested is relatively easy to obtain in HTTP logs
    corecore