2,632 research outputs found

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    Libraries and Museums in the Flat World: Are They Becoming Virtual Destinations?

    Get PDF
    In his recent book, “TheWorld is Flat”, Thomas L. Friedman reviews the impact of networks on globalization. The emergence of the Internet, web browsers, computer applications talking to each other through the Internet, and the open source software, among others, made the world flatter and created an opportunity for individuals to collaborate and compete globally. Friedman predicts that “connecting all the knowledge centers on the planet together into a single global network
could usher in an amazing era of prosperity and innovation”. Networking also is changing the ways by which libraries and museums provide access to information sources and services. In the flat world, libraries and museums are no longer a physical “place” only: they are becoming “virtual destinations”. This paper discusses the implications of this transformation for the digitization and preservation of, and access to, cultural heritage resources

    Doing blog research: the computational turn

    Get PDF
    Blogs and other online platforms for personal writing such as LiveJournal have been of interest to researchers across the social sciences and humanities for a decade now. Although growth in the uptake of blogging has stalled somewhat since the heyday of blogs in the early 2000s, blogging continues to be a major genre of Internet-based communication. Indeed, at the same time that mass participation has moved on to Facebook, Twitter, and other more recent communication phenomena, what has been left behind by the wave of mass adoption is a slightly smaller but all the more solidly established blogosphere of engaged and committed participants. Blogs are now an accepted part of institutional, group, and personal communications strategies (Bruns and Jacobs, 2006); in style and substance, they are situated between the more static information provided by conventional Websites and Webpages and the continuous newsfeeds provided through Facebook and Twitter updates. Blogs provide a vehicle for authors (and their commenters) to think through given topics in the space of a few hundred to a few thousand words – expanding, perhaps, on shorter tweets, and possibly leading to the publication of more fully formed texts elsewhere. Additionally, they are also a very flexible medium: they readily provide the functionality to include images, audio, video, and other additional materials – as well as the fundamental tool of blogging, the hyperlink itself. This chapter appeared in the Sage collection Research Methods & Methodologies in Education edited by James Arthur, Michael Waring, Robert Coe, and Larry V. Hedges. This version is a pre-print edition of the chapter

    Data Scraping as a Cause of Action: Limiting Use of the CFAA and Trespass in Online Copying Cases

    Get PDF
    In recent years, online platforms have used claims such as the Computer Fraud and Abuse Act (“CFAA”) and trespass to curb data scraping, or copying of web content accomplished using robots or web crawlers. However, as the term “data scraping” implies, the content typically copied is data or information that is not protected by intellectual property law, and the means by which the copying occurs is not considered to be hacking. Trespass and the CFAA are both concerned with authorization, but in data scraping cases, these torts are used in such a way that implies that real property norms exist on the Internet, a misleading and harmful analogy. To correct this imbalance, the CFAA must be interpreted in its native context, that of computers, computer networks, and the Internet, and given contextual meaning. Alternatively, the CFAA should be amended. Because data scraping is fundamentally copying, copyright offers the correct means for litigating data scraping cases. This Note additionally offers proposals for creating enforceable terms of service online and for strengthening copyright to make it applicable to user-based online platforms

    Scraping the Social? Issues in live social research

    Get PDF
    What makes scraping methodologically interesting for social and cultural research? This paper seeks to contribute to debates about digital social research by exploring how a ‘medium-specific’ technique for online data capture may be rendered analytically productive for social research. As a device that is currently being imported into social research, scraping has the capacity to re-structure social research, and this in at least two ways. Firstly, as a technique that is not native to social research, scraping risks to introduce ‘alien’ methodological assumptions into social research (such as an pre-occupation with freshness). Secondly, to scrape is to risk importing into our inquiry categories that are prevalent in the social practices enabled by the media: scraping makes available already formatted data for social research. Scraped data, and online social data more generally, tend to come with ‘external’ analytics already built-in. This circumstance is often approached as a ‘problem’ with online data capture, but we propose it may be turned into virtue, insofar as data formats that have currency in the areas under scrutiny may serve as a source of social data themselves. Scraping, we propose, makes it possible to render traffic between the object and process of social research analytically productive. It enables a form of ‘real-time’ social research, in which the formats and life cycles of online data may lend structure to the analytic objects and findings of social research. By way of a conclusion, we demonstrate this point in an exercise of online issue profiling, and more particularly, by relying on Twitter to profile the issue of ‘austerity’. Here we distinguish between two forms of real-time research, those dedicated to monitoring live content (which terms are current?) and those concerned with analysing the liveliness of issues (which topics are happening?)
    • 

    corecore