1,081 research outputs found

    Design, implementation and experiment of a YeSQL Web Crawler

    Full text link
    We describe a novel, "focusable", scalable, distributed web crawler based on GNU/Linux and PostgreSQL that we designed to be easily extendible and which we have released under a GNU public licence. We also report a first use case related to an analysis of Twitter's streams about the french 2012 presidential elections and the URL's it contains

    Buzz monitoring in word space

    Get PDF
    This paper discusses the task of tracking mentions of some topically interesting textual entity from a continuously and dynamically changing flow of text, such as a news feed, the output from an Internet crawler or a similar text source - a task sometimes referred to as buzz monitoring. Standard approaches from the field of information access for identifying salient textual entities are reviewed, and it is argued that the dynamics of buzz monitoring calls for more accomplished analysis mechanisms than the typical text analysis tools provide today. The notion of word space is introduced, and it is argued that word spaces can be used to select the most salient markers for topicality, find associations those observations engender, and that they constitute an attractive foundation for building a representation well suited for the tracking and monitoring of mentions of the entity under consideration

    Linking Audiences to News: A Network Analysis of Chicago Websites

    Get PDF
    The mass media model, which sustained news and information in communities like Chicago for decades, is being replaced by a "new news ecosystem" consisting of hundreds of websites, podcasts, video streams and mobile applications. In 2009, The Chicago Community Trust set out to understand this ecosystem, assess its health and make investments in improving the flow of news and information in Chicagoland. The report you are reading is one of the products of the Trust's local information initiative, Community News Matters. "Linking Audiences to News: A Network Analysis of Chicago Websites" is one of the first -- perhaps the first -- research projects seeking to understand a local

    A Proposed Architecture for Continuous Web Monitoring Through Online Crawling of Blogs

    Full text link
    Getting informed of what is registered in the Web space on time, can greatly help the psychologists, marketers and political analysts to familiarize, analyse, make decision and act correctly based on the society`s different needs. The great volume of information in the Web space hinders us to continuously online investigate the whole space of the Web. Focusing on the considered blogs limits our working domain and makes the online crawling in the Web space possible. In this article, an architecture is offered which continuously online crawls the related blogs, using focused crawler, and investigates and analyses the obtained data. The online fetching is done based on the latest announcements of the ping server machines. A weighted graph is formed based on targeting the important key phrases, so that a focused crawler can do the fetching of the complete texts of the related Web pages, based on the weighted graph.Comment: 10 pages, 2 figure

    Bots, Seeds and People: Web Archives as Infrastructure

    Full text link
    The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community

    Scraping the Social? Issues in live social research

    Get PDF
    What makes scraping methodologically interesting for social and cultural research? This paper seeks to contribute to debates about digital social research by exploring how a ‘medium-specific’ technique for online data capture may be rendered analytically productive for social research. As a device that is currently being imported into social research, scraping has the capacity to re-structure social research, and this in at least two ways. Firstly, as a technique that is not native to social research, scraping risks to introduce ‘alien’ methodological assumptions into social research (such as an pre-occupation with freshness). Secondly, to scrape is to risk importing into our inquiry categories that are prevalent in the social practices enabled by the media: scraping makes available already formatted data for social research. Scraped data, and online social data more generally, tend to come with ‘external’ analytics already built-in. This circumstance is often approached as a ‘problem’ with online data capture, but we propose it may be turned into virtue, insofar as data formats that have currency in the areas under scrutiny may serve as a source of social data themselves. Scraping, we propose, makes it possible to render traffic between the object and process of social research analytically productive. It enables a form of ‘real-time’ social research, in which the formats and life cycles of online data may lend structure to the analytic objects and findings of social research. By way of a conclusion, we demonstrate this point in an exercise of online issue profiling, and more particularly, by relying on Twitter to profile the issue of ‘austerity’. Here we distinguish between two forms of real-time research, those dedicated to monitoring live content (which terms are current?) and those concerned with analysing the liveliness of issues (which topics are happening?)

    Who are we talking about? Identifying scientific populations online

    Get PDF
    In this paper, we begin to address the question of which scientists are online. Prior studies have shown that Web users are only a segmented reflection of the actual off-line population, and thus when studying online behaviors we need to be explicit about the representativeness of the sample under study to accurately relate trends to populations. When studying social phenomena on the Web, the identification of individuals is essential to be able to generalize about specific segments of a population off-line. Specifically, we present a method for assessing the online activity of a known set of actors. The method is tailored to the domain of science. We apply the method to a population of Dutch computer scientists and their coauthors. The results when combined with metadata of the set provide insights into the representativeness of the sample of interest. The study results show that scientists of above-average tenure and performance are overrepresented online, suggesting that when studying online behaviors of scientists we are commenting specifically on the behaviors of above-average-performing scientists. Given this finding, metrics of Web behaviors of science may provide a key tool for measuring knowledge production and innovation at a faster rate than traditional delayed bibliometric studies
    • …
    corecore