3,144 research outputs found

    An Optimal Trade-off between Content Freshness and Refresh Cost

    Full text link
    Caching is an effective mechanism for reducing bandwidth usage and alleviating server load. However, the use of caching entails a compromise between content freshness and refresh cost. An excessive refresh allows a high degree of content freshness at a greater cost of system resource. Conversely, a deficient refresh inhibits content freshness but saves the cost of resource usages. To address the freshness-cost problem, we formulate the refresh scheduling problem with a generic cost model and use this cost model to determine an optimal refresh frequency that gives the best tradeoff between refresh cost and content freshness. We prove the existence and uniqueness of an optimal refresh frequency under the assumptions that the arrival of content update is Poisson and the age-related cost monotonically increases with decreasing freshness. In addition, we provide an analytic comparison of system performance under fixed refresh scheduling and random refresh scheduling, showing that with the same average refresh frequency two refresh schedulings are mathematically equivalent in terms of the long-run average cost

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    CLEAR: a credible method to evaluate website archivability

    Get PDF
    Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- uence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    Cloud WorkBench - Infrastructure-as-Code Based Cloud Benchmarking

    Full text link
    To optimally deploy their applications, users of Infrastructure-as-a-Service clouds are required to evaluate the costs and performance of different combinations of cloud configurations to find out which combination provides the best service level for their specific application. Unfortunately, benchmarking cloud services is cumbersome and error-prone. In this paper, we propose an architecture and concrete implementation of a cloud benchmarking Web service, which fosters the definition of reusable and representative benchmarks. In distinction to existing work, our system is based on the notion of Infrastructure-as-Code, which is a state of the art concept to define IT infrastructure in a reproducible, well-defined, and testable way. We demonstrate our system based on an illustrative case study, in which we measure and compare the disk IO speeds of different instance and storage types in Amazon EC2

    Design and Analysis of a Dynamically Configured Log-based Distributed Security Event Detection Methodology

    Get PDF
    Military and defense organizations rely upon the security of data stored in, and communicated through, their cyber infrastructure to fulfill their mission objectives. It is essential to identify threats to the cyber infrastructure in a timely manner, so that mission risks can be recognized and mitigated. Centralized event logging and correlation is a proven method for identifying threats to cyber resources. However, centralized event logging is inflexible and does not scale well, because it consumes excessive network bandwidth and imposes significant storage and processing requirements on the central event log server. In this paper, we present a flexible, distributed event correlation system designed to overcome these limitations by distributing the event correlation workload across the network of event-producing systems. To demonstrate the utility of the methodology, we model and simulate centralized, decentralized, and hybrid log analysis environments over three accountability levels and compare their performance in terms of detection capability, network bandwidth utilization, database query efficiency, and configurability. The results show that when compared to centralized event correlation, dynamically configured distributed event correlation provides increased flexibility, a significant reduction in network traffic in low and medium accountability environments, and a decrease in database query execution time in the high-accountability case

    An integrating text retrieval framework for Digital Ecosystems Paradigm

    Get PDF
    The purpose of the research is to provide effective information retrieval services for digital ?organisms? in a digital ecosystem by leveraging the power of Web searching technology. A novel integrating digital ecosystem search framework (a new digital organism) is proposed which employs the Web search technology and traditional database searching techniques to provide economic organisms with comprehensive, dynamic, and organization-oriented information retrieval ranging from the Internet to personal (semantic) desktop

    Carbon information disclosure of enterprises and their value creation through market liquidity and cost of equity capital

    Get PDF
    Purpose: Drawing on asymmetric information and stakeholder theories, this paper investigates two mechanisms, namely market liquidity and cost of equity capital, by which the carbon information disclosure of enterprises can benefit their value creation. Design/methodology/approach: In this research, web crawler technology is employed to study the link between carbon information disclosure and enterprises value creation,and the carbon information data are provided by all companies listed in Chinese A-share market Findings: The results show that carbon information disclosure have significant positive influence on enterprise value creation, which is embodied in the relationship between carbon information disclosure quantity, depth and enterprise value creation, and market liquidity and cost of equity capital play partially mediating role in it, while the influence of carbon information disclosure quality and concentration on enterprise value creation are not significant in statistics. Research limitations/implications: This paper explains the influence path and mechanism between carbon information disclosure and enterprise value creation deeply, answers the question of whether carbon information disclosure affects enterprise value creation or not in China.Practical implications: This paper finds that carbon information disclosure contributes positively to enterprise value creation suggests that managers can reap more financial benefits by disclosing more carbon information and investing carbon emissions management. So, managers in the enterprises should strengthen the management of carbon information disclosure behavior. Originality/value: The paper gives a different perspective on the influence of carbon information disclosure on enterprise value creation, and suggests a new direction to understand carbon information disclosure behavior.Peer Reviewe
    corecore