916 research outputs found

    How Much of the Web Is Archived?

    Full text link
    Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy, 17%-49% has between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.Comment: This is the long version of the short paper by the same title published at JCDL'11. 10 pages, 5 figures, 7 tables. Version 2 includes minor typographical correction

    Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks

    Full text link
    Malware still constitutes a major threat in the cybersecurity landscape, also due to the widespread use of infection vectors such as documents. These infection vectors hide embedded malicious code to the victim users, facilitating the use of social engineering techniques to infect their machines. Research showed that machine-learning algorithms provide effective detection mechanisms against such threats, but the existence of an arms race in adversarial settings has recently challenged such systems. In this work, we focus on malware embedded in PDF files as a representative case of such an arms race. We start by providing a comprehensive taxonomy of the different approaches used to generate PDF malware, and of the corresponding learning-based detection systems. We then categorize threats specifically targeted against learning-based PDF malware detectors, using a well-established framework in the field of adversarial machine learning. This framework allows us to categorize known vulnerabilities of learning-based PDF malware detectors and to identify novel attacks that may threaten such systems, along with the potential defense mechanisms that can mitigate the impact of such threats. We conclude the paper by discussing how such findings highlight promising research directions towards tackling the more general challenge of designing robust malware detectors in adversarial settings

    Archiving scientific data

    Get PDF
    We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions in which it appears. The basic idea of timestamping was discovered by Driscoll et. al. in the context of persistent data structures where one wishes to track the sequences of changes made to a data structure. We extend this idea to develop an archiving tool for XML data that is capable of providing meaningful change descriptions and can also efficiently support a variety of basic functions concerning the evolution of data such as retrieval of any specific version from the archive and querying the temporal history of any element. This is in contrast to diff-based approaches where such operations may require undoing a large number of changes or significant reasoning with the deltas. Surprisingly, our archiving technique does not incur any significant space overhead when contrasted with other approaches. Our experimental results support this and also show that the compacted archive file interacts well with other compression techniques. Finally, another useful property of our approach is that the resulting archive is also in XML and hence can directly leverage existing XML tools

    Automatic discovery of high-level provenance using semantic similarity

    Get PDF
    As interest in provenance grows among the Semantic Web community, it is recognized as a useful tool across many domains. However, existing automatic provenance collection techniques are not universally applicable. Most existing methods either rely on (low-level) observed provenance, or require that the user discloses formal workflows. In this paper, we propose a new approach for automatic discovery of provenance, at multiple levels of granularity. To accomplish this, we detect entity derivations, relying on clustering algorithms, linked data and semantic similarity. The resulting derivations are structured in compliance with the Provenance Data Model (PROV-DM). While the proposed approach is purposely kept general, allowing adaptation in many use cases, we provide an implementation for one of these use cases, namely discovering the sources of news articles. With this implementation, we were able to detect 73% of the original sources of 410 news stories, at 68% precision. Lastly, we discuss possible improvements and future work

    The missing link : theoretical reflections on decision reconstruction

    Get PDF
    In this paper, we address theoretical considerations on the problem of decision reconstruction, which is defined as the process that allows an individual or group of individuals, whether internal or external to the organization, to understand how a group, using a GSS, reached a previous decision. We also analyze the implications of decision reconstruction with regard to both group support systems (GSS) research and knowledge management We present an information model, whose constituting elements are not only concerned with GSS decision-making, but also towards GSS decision reconstruction. Using a GSS prototype based on the proposed model, we made a preliminary test in order to analyze how different people act when reconstructing decisions. In the process, we have exposed and detected limitations and present a solution proposal to overcome these limitations

    Product Differentiation for Software-as-a-Service Providers

    Get PDF
    The market for the new provisioningtype Software-as-a-Service (SaaS) hasreached a significant size and still showsenormous growth rates. By varying sizeof SaaS products, providers can improvetheir market position and profitsby successfully acting in the tensionarea of customer acquisition, pricingand costs. We first elaborate differencesconcerning product differentiationbetween classic software provisioningmodels and SaaS. Then, we introducea micro-economic based decisionmodel to maximize the return of aprovider by finding an optimal granularity,i.e. by varying the size of services.This paper makes two contributions inthis context: (1) it provides a conceptualfoundation for product differentiationwithin the scope of SaaS and(2) it presents the first implementationof variable reproduction costs for webbased software offers. The model is illustratedby a real world case with datafrom a SaaS provider

    Gathering Knowledge from Social Knowledge Management Environments: Validation of an Anticipatory Standard

    Get PDF
    Knowledge management is more and more happening in social environments, supported by social software. This directly changes the way knowledge workers interact and the way information and communication technology is used. Recent studies, striving to provide a more appropriate support for knowledge work, face challenges when eliciting knowledge from user activities and maintaining its situatedness in context. Corresponding solutions in such social environments are not interoperable due to a lack of appropriate standards. To bridge this gap, we propose and validate a first specification of an anticipatory standard in this field. We illustrate its application and utility analyzing three scenarios. As main result we analyze the lessons learned and provide insights into further research and development of our approach. By that we reach out to stimulate discussion and raise support for this initiative towards establishing standards in the domain of knowledge management
    corecore