Search CORE

85 research outputs found

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Author: Dengchao He
Donghui Zhan
Houqing Lu
Lei Zhou
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler

Crossref

Directory of Open Access Journals

An Implementation of a Dynamic Partitioning Scheme for Web Pages

Author: Ben Blake
Ben Blake
Janche Sang
Timothy Arndt
Timothy Arndt
Publication venue: EngagedScholarship@CSU
Publication date: 01/01/2012
Field of study

In this paper, we introduce a method for the dynamic partitioning of web pages. The algorithm is first illustrated by manually partitioning a web page, then the implementation of the algorithm using PHP is described. The method results in a partitioned web page consisting of small pieces or fragments which can be retrieved concurrently using AJAX or similar technology. The goal of this research is to increase performance of web page delivery by decreasing the latency of web page retrieval

CiteSeerX

Directory of Open Access Journals

Cleveland-Marshall College of Law

Exploration of Dynamic Web Page Partitioning for Increased Web Page Delivery Performance

Author: Krupp Brian Michael
Publication venue: EngagedScholarship@CSU
Publication date: 01/01/2010
Field of study

The increasing use of the Internet and demand for real-time information has increased the amount of dynamic content generated residing in more complex distributed environments. The performance of delivering these web pages has been improved through more traditional techniques such as caching and newer techniques such as pre-fetching. In this research, we explore the dynamic partitioning of web page content using concurrent AJAX requests to improve web page delivery performance for resource intensive synchronous web content. The focus is more on enterprise web applications that exist in an environment such that a page\u27s data and processing is not local to one web server, rather requests are made from the page to other systems such as database, web services, and legacy systems. From these types of environments, the dynamic partitioning method can make the most performance gains by allowing the web server to run requests for partitions of a page in parallel while other systems return requested data. This differentiates from traditional uses of AJAX where traditionally AJAX is used for a richer user experience making a web application appear to be a desktop application on the user\u27s machine. Often these AJAX requests are also initiated by a user action such as a mouse click, key press, or used to check the server periodically for updates. In this research we studied the performance of a manually partitioned page and built a dynamic parser to perform dynamic partitioning and analyzed the performance results of two types of applications, one where most processing is local and another where processing is dependent on other systems such as database, web services and legacy systems. The results presented show that there are definite performance gains in using a partitioning scheme in a web page to deliver the web page faster to the use

Cleveland-Marshall College of Law

Exploration of Dynamic Web Page Partitioning for Increased Web Page Delivery Performance

Author: Krupp Brian Michael
Publication venue: EngagedScholarship@CSU
Publication date: 01/01/2010
Field of study

OhioLINK Electronic Thesis and Dissertation Center

Cleveland-Marshall College of Law

Recommended from our members

A complex situation in data recovery

Author: Ashworth J Sue
Publication venue: Brunel University, School of Information Systems, Computing and Mathematics
Publication date: 01/01/2009
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The research considers an unusual situation in data recovery. Data recovery is the process of recovering data from recording media that is not accessible by normal means. Providing that the data has not been overwritten or the recording medium physically damaged, this is usually a relatively simple process of either repairing the file system so that the file(s) may be accessed as usual or finding the data on the medium and copying it directly from the medium into normal file(s). The data in this recovery situation is recorded by specialist call centre recording equipment and is stored on the recording medium in a proprietary format whereby simultaneous conversations are multiplexed together and can only be accessed by using associated metadata records. The value of the recorded data may be very high especially in the financial sector where it may be considered a legal audit of business transactions. When a failure occurs and data needs to be recovered, both the data and metadata information must be recreated before a single call can be replayed. A key component to accessing this information is the location metadata that identifies the location of the required components on the medium. If the metadata is corrupted, incomplete or wrong then a repair cannot proceed until it is corrected. This research focuses on the problem of verifying this location metadata. Initially it was believed that only a small set of errors would exist and work centred on detecting these errors by presenting the information to engineers in an at-a-glance image. When the extent of the possible errors was realised, an attempt was made to deduce location metadata by exploring the content of the recorded medium. Although successful in one instance, the process was not able to distinguish between current and previous uses. Eventually insights gained from exploration of the recording application's source code, permitted an intelligent trial and error process which deduced the underlying medium apportioning formula. It was then possible to incorporate this formula into the heuristics, generating the at-a-glance image, to create an artefact that could verify the location metadata for any given repair. After discovering the formula, the research returned to the media exploration and the produced disk fingerprinting technique. The disk fingerprinting technique gave valuable insights into error states in call centre recording and provided a new way of seeing the contents of a hard drive. This research provided the following contributions: 1. It has provided a means by which the recording systems' location metadata can be verified and repaired. 2. As a result of this verification, greater automation of the recovery process is now possible before the need for human verification is required. 3. The disk fingerprinting process. This has already given insights into the recording system's problems and is able to provide a new way of seeing the contents of recording media

Brunel University Research Archive

Automatic Generation of Thematically Focused Information Portals from Web Data

Author: Sizov Sergej
Publication venue: Sonstige Einrichtungen. Sonstige Einrichtungen
Publication date: 01/01/2005
Field of study

Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology

Universaar

MPG.PuRe

Acronym

A complex situation in data recovery

Author: Ashworth J Sue
de Cesare S
Lycett M
Publication venue
Publication date: 01/01/2009
Field of study

The research considers an unusual situation in data recovery. Data recovery is the process of recovering data from recording media that is not accessible by normal means. Providing that the data has not been overwritten or the recording medium physically damaged, this is usually a relatively simple process of either repairing the file system so that the file(s) may be accessed as usual or finding the data on the medium and copying it directly from the medium into normal file(s). The data in this recovery situation is recorded by specialist call centre recording equipment and is stored on the recording medium in a proprietary format whereby simultaneous conversations are multiplexed together and can only be accessed by using associated metadata records. The value of the recorded data may be very high especially in the financial sector where it may be considered a legal audit of business transactions. When a failure occurs and data needs to be recovered, both the data and metadata information must be recreated before a single call can be replayed. A key component to accessing this information is the location metadata that identifies the location of the required components on the medium. If the metadata is corrupted, incomplete or wrong then a repair cannot proceed until it is corrected. This research focuses on the problem of verifying this location metadata. Initially it was believed that only a small set of errors would exist and work centred on detecting these errors by presenting the information to engineers in an at-a-glance image. When the extent of the possible errors was realised, an attempt was made to deduce location metadata by exploring the content of the recorded medium. Although successful in one instance, the process was not able to distinguish between current and previous uses. Eventually insights gained from exploration of the recording application's source code, permitted an intelligent trial and error process which deduced the underlying medium apportioning formula. It was then possible to incorporate this formula into the heuristics, generating the at-a-glance image, to create an artefact that could verify the location metadata for any given repair. After discovering the formula, the research returned to the media exploration and the produced disk fingerprinting technique. The disk fingerprinting technique gave valuable insights into error states in call centre recording and provided a new way of seeing the contents of a hard drive. This research provided the following contributions: 1. It has provided a means by which the recording systems' location metadata can be verified and repaired. 2. As a result of this verification, greater automation of the recovery process is now possible before the need for human verification is required. 3. The disk fingerprinting process. This has already given insights into the recording system's problems and is able to provide a new way of seeing the contents of recording media.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

The InfoSec Handbook

Author: Nayak Umesha
Rao Umesh Hodeghatta
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

Computer scienc

Directory of Open Access Books (DOAB)

Fine-grained, Content-agnostic Network Traffic Analysis for Malicious Activity Detection

Author: Feng Yebo
Publication venue: University of Oregon
Publication date: 09/01/2024
Field of study

The rapid evolution of malicious activities in network environments necessitates the development of more effective and efficient detection and mitigation techniques. Traditional traffic analysis (TA) approaches have demonstrated limited efficacy and performance in detecting various malicious activities, resulting in a pressing need for more advanced solutions. To fill the gap, this dissertation proposes several new fine-grained network traffic analysis (FGTA) approaches. These approaches focus on (1) detecting previously hard-to-detect malicious activities by deducing fine-grained, detailed application-layer information in privacy-preserving manners, (2) enhancing usability by providing more explainable results and better adaptability to different network environments, and (3) combining network traffic data with endpoint information to provide users with more comprehensive and accurate protections. We begin by conducting a comprehensive survey of existing FGTA approaches. We then propose CJ-Sniffer, a privacy-aware cryptojacking detection system that efficiently detects cryptojacking traffic. CJ-Sniffer is the first approach to distinguishing cryptojacking traffic from user-initiated cryptocurrency mining traffic, allowing for fine-grained traffic discrimination. This level of fine-grained traffic discrimination has proven challenging to accomplish through traditional TA methodologies. Next, we introduce BotFlowMon, a learning-based, content-agnostic approach for detecting online social network (OSN) bot traffic, which has posed a significant challenge for detection using traditional TA strategies. BotFlowMon is an FGTA approach that relies only on content-agnostic flow-level data as input and utilizes novel algorithms and techniques to classify social bot traffic from real OSN user traffic. To enhance the usability of FGTA-based attack detection, we propose a learning-based DDoS detection approach that emphasizes both explainability and adaptability. This approach provides network administrators with insightful explanatory information and adaptable models for new network environments. Finally, we present a reinforcement learning-based defense approach against L7 DDoS attacks, which combines network traffic data with endpoint information to operate. The proposed approach actively monitors and analyzes the victim server and applies different strategies under different conditions to protect the server while minimizing collateral damage to legitimate requests. Our evaluation results demonstrate that the proposed approaches achieve high accuracy and efficiency in detecting and mitigating various malicious activities, while maintaining privacy-preserving features, providing explainable and adaptable results, or providing comprehensive application-layer situational awareness. This dissertation significantly advances the fields of FGTA and malicious activity detection. This dissertation includes published and unpublished co-authored materials

University of Oregon Scholars' Bank

The InfoSec Handbook

Author: Nayak Umesha
Rao Umesh Hodeghatta
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Computer scienc

OAPEN Library