85 research outputs found

    An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

    Get PDF
    A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler

    An Implementation of a Dynamic Partitioning Scheme for Web Pages

    Get PDF
    In this paper, we introduce a method for the dynamic partitioning of web pages. The algorithm is first illustrated by manually partitioning a web page, then the implementation of the algorithm using PHP is described. The method results in a partitioned web page consisting of small pieces or fragments which can be retrieved concurrently using AJAX or similar technology. The goal of this research is to increase performance of web page delivery by decreasing the latency of web page retrieval

    Exploration of Dynamic Web Page Partitioning for Increased Web Page Delivery Performance

    Get PDF
    The increasing use of the Internet and demand for real-time information has increased the amount of dynamic content generated residing in more complex distributed environments. The performance of delivering these web pages has been improved through more traditional techniques such as caching and newer techniques such as pre-fetching. In this research, we explore the dynamic partitioning of web page content using concurrent AJAX requests to improve web page delivery performance for resource intensive synchronous web content. The focus is more on enterprise web applications that exist in an environment such that a page\u27s data and processing is not local to one web server, rather requests are made from the page to other systems such as database, web services, and legacy systems. From these types of environments, the dynamic partitioning method can make the most performance gains by allowing the web server to run requests for partitions of a page in parallel while other systems return requested data. This differentiates from traditional uses of AJAX where traditionally AJAX is used for a richer user experience making a web application appear to be a desktop application on the user\u27s machine. Often these AJAX requests are also initiated by a user action such as a mouse click, key press, or used to check the server periodically for updates. In this research we studied the performance of a manually partitioned page and built a dynamic parser to perform dynamic partitioning and analyzed the performance results of two types of applications, one where most processing is local and another where processing is dependent on other systems such as database, web services and legacy systems. The results presented show that there are definite performance gains in using a partitioning scheme in a web page to deliver the web page faster to the use

    Exploration of Dynamic Web Page Partitioning for Increased Web Page Delivery Performance

    Get PDF
    The increasing use of the Internet and demand for real-time information has increased the amount of dynamic content generated residing in more complex distributed environments. The performance of delivering these web pages has been improved through more traditional techniques such as caching and newer techniques such as pre-fetching. In this research, we explore the dynamic partitioning of web page content using concurrent AJAX requests to improve web page delivery performance for resource intensive synchronous web content. The focus is more on enterprise web applications that exist in an environment such that a page\u27s data and processing is not local to one web server, rather requests are made from the page to other systems such as database, web services, and legacy systems. From these types of environments, the dynamic partitioning method can make the most performance gains by allowing the web server to run requests for partitions of a page in parallel while other systems return requested data. This differentiates from traditional uses of AJAX where traditionally AJAX is used for a richer user experience making a web application appear to be a desktop application on the user\u27s machine. Often these AJAX requests are also initiated by a user action such as a mouse click, key press, or used to check the server periodically for updates. In this research we studied the performance of a manually partitioned page and built a dynamic parser to perform dynamic partitioning and analyzed the performance results of two types of applications, one where most processing is local and another where processing is dependent on other systems such as database, web services and legacy systems. The results presented show that there are definite performance gains in using a partitioning scheme in a web page to deliver the web page faster to the use

    Automatic Generation of Thematically Focused Information Portals from Web Data

    Get PDF
    Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology

    A complex situation in data recovery

    Get PDF
    The research considers an unusual situation in data recovery. Data recovery is the process of recovering data from recording media that is not accessible by normal means. Providing that the data has not been overwritten or the recording medium physically damaged, this is usually a relatively simple process of either repairing the file system so that the file(s) may be accessed as usual or finding the data on the medium and copying it directly from the medium into normal file(s). The data in this recovery situation is recorded by specialist call centre recording equipment and is stored on the recording medium in a proprietary format whereby simultaneous conversations are multiplexed together and can only be accessed by using associated metadata records. The value of the recorded data may be very high especially in the financial sector where it may be considered a legal audit of business transactions. When a failure occurs and data needs to be recovered, both the data and metadata information must be recreated before a single call can be replayed. A key component to accessing this information is the location metadata that identifies the location of the required components on the medium. If the metadata is corrupted, incomplete or wrong then a repair cannot proceed until it is corrected. This research focuses on the problem of verifying this location metadata. Initially it was believed that only a small set of errors would exist and work centred on detecting these errors by presenting the information to engineers in an at-a-glance image. When the extent of the possible errors was realised, an attempt was made to deduce location metadata by exploring the content of the recorded medium. Although successful in one instance, the process was not able to distinguish between current and previous uses. Eventually insights gained from exploration of the recording application's source code, permitted an intelligent trial and error process which deduced the underlying medium apportioning formula. It was then possible to incorporate this formula into the heuristics, generating the at-a-glance image, to create an artefact that could verify the location metadata for any given repair. After discovering the formula, the research returned to the media exploration and the produced disk fingerprinting technique. The disk fingerprinting technique gave valuable insights into error states in call centre recording and provided a new way of seeing the contents of a hard drive. This research provided the following contributions: 1. It has provided a means by which the recording systems' location metadata can be verified and repaired. 2. As a result of this verification, greater automation of the recovery process is now possible before the need for human verification is required. 3. The disk fingerprinting process. This has already given insights into the recording system's problems and is able to provide a new way of seeing the contents of recording media.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    The InfoSec Handbook

    Get PDF
    Computer scienc

    Fine-grained, Content-agnostic Network Traffic Analysis for Malicious Activity Detection

    Get PDF
    The rapid evolution of malicious activities in network environments necessitates the development of more effective and efficient detection and mitigation techniques. Traditional traffic analysis (TA) approaches have demonstrated limited efficacy and performance in detecting various malicious activities, resulting in a pressing need for more advanced solutions. To fill the gap, this dissertation proposes several new fine-grained network traffic analysis (FGTA) approaches. These approaches focus on (1) detecting previously hard-to-detect malicious activities by deducing fine-grained, detailed application-layer information in privacy-preserving manners, (2) enhancing usability by providing more explainable results and better adaptability to different network environments, and (3) combining network traffic data with endpoint information to provide users with more comprehensive and accurate protections. We begin by conducting a comprehensive survey of existing FGTA approaches. We then propose CJ-Sniffer, a privacy-aware cryptojacking detection system that efficiently detects cryptojacking traffic. CJ-Sniffer is the first approach to distinguishing cryptojacking traffic from user-initiated cryptocurrency mining traffic, allowing for fine-grained traffic discrimination. This level of fine-grained traffic discrimination has proven challenging to accomplish through traditional TA methodologies. Next, we introduce BotFlowMon, a learning-based, content-agnostic approach for detecting online social network (OSN) bot traffic, which has posed a significant challenge for detection using traditional TA strategies. BotFlowMon is an FGTA approach that relies only on content-agnostic flow-level data as input and utilizes novel algorithms and techniques to classify social bot traffic from real OSN user traffic. To enhance the usability of FGTA-based attack detection, we propose a learning-based DDoS detection approach that emphasizes both explainability and adaptability. This approach provides network administrators with insightful explanatory information and adaptable models for new network environments. Finally, we present a reinforcement learning-based defense approach against L7 DDoS attacks, which combines network traffic data with endpoint information to operate. The proposed approach actively monitors and analyzes the victim server and applies different strategies under different conditions to protect the server while minimizing collateral damage to legitimate requests. Our evaluation results demonstrate that the proposed approaches achieve high accuracy and efficiency in detecting and mitigating various malicious activities, while maintaining privacy-preserving features, providing explainable and adaptable results, or providing comprehensive application-layer situational awareness. This dissertation significantly advances the fields of FGTA and malicious activity detection. This dissertation includes published and unpublished co-authored materials

    The InfoSec Handbook

    Get PDF
    Computer scienc
    • …
    corecore