761 research outputs found
Libraries and Museums in the Flat World: Are They Becoming Virtual Destinations?
In his recent book, “TheWorld is Flat”, Thomas L. Friedman reviews the impact of networks on globalization. The emergence of the Internet, web browsers, computer applications talking to each other through the Internet, and the open source software, among others, made the world flatter and created an opportunity for individuals to collaborate and compete globally. Friedman predicts that “connecting all the knowledge centers on the planet together into a single global network…could usher in an amazing era of prosperity and innovation”. Networking also is changing the ways by which libraries and museums provide access to information sources and services. In the flat world, libraries and museums are no longer a physical “place” only: they are becoming “virtual destinations”. This paper discusses the implications of this transformation for the digitization and preservation of, and access to, cultural heritage resources
JISC Preservation of Web Resources (PoWR) Handbook
Handbook of Web Preservation produced by the JISC-PoWR project which ran from April to November 2008.
The handbook specifically addresses digital preservation issues that are relevant to the UK HE/FE web management community”.
The project was undertaken jointly by UKOLN at the University of Bath and ULCC Digital Archives department
Scraping SERPs for Archival Seeds: It Matters When You Start
Event-based collections are often started with a web search, but the search
results you find on Day 1 may not be the same as those you find on Day 7. In
this paper, we consider collections that originate from extracting URIs
(Uniform Resource Identifiers) from Search Engine Result Pages (SERPs).
Specifically, we seek to provide insight about the retrievability of URIs of
news stories found on Google, and to answer two main questions: first, can one
"refind" the same URI of a news story (for the same query) from Google after a
given time? Second, what is the probability of finding a story on Google over a
given period of time? To answer these questions, we issued seven queries to
Google every day for over seven months (2017-05-25 to 2018-01-12) and collected
links from the first five SERPs to generate seven collections for each query.
The queries represent public interest stories: "healthcare bill," "manchester
bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey,"
and "hurricane irma." We tracked each URI in all collections over time to
estimate the discoverability of URIs from the first five SERPs. Our results
showed that the daily average rate at which stories were replaced on the
default Google SERP ranged from 0.21 -0.54, and a weekly rate of 0.39 - 0.79,
suggesting the fast replacement of older stories by newer stories. The
probability of finding the same URI of a news story after one day from the
initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the
probability of finding the same news stories diminishes rapidly to 0.01 - 0.11.
Our findings suggest that due to the difficulty in retrieving the URIs of news
stories from Google, collection building that originates from search engines
should begin as soon as possible in order to capture the first stages of
events, and should persist in order to capture the evolution of the events...Comment: This is an extended version of the ACM/IEEE Joint Conference on
Digital Libraries (JCDL 2018) full paper:
https://doi.org/10.1145/3197026.3197056. Some of the figure numbers have
change
Data Scraping as a Cause of Action: Limiting Use of the CFAA and Trespass in Online Copying Cases
In recent years, online platforms have used claims such as the Computer Fraud and Abuse Act (“CFAA”) and trespass to curb data scraping, or copying of web content accomplished using robots or web crawlers. However, as the term “data scraping” implies, the content typically copied is data or information that is not protected by intellectual property law, and the means by which the copying occurs is not considered to be hacking. Trespass and the CFAA are both concerned with authorization, but in data scraping cases, these torts are used in such a way that implies that real property norms exist on the Internet, a misleading and harmful analogy.
To correct this imbalance, the CFAA must be interpreted in its native context, that of computers, computer networks, and the Internet, and given contextual meaning. Alternatively, the CFAA should be amended. Because data scraping is fundamentally copying, copyright offers the correct means for litigating data scraping cases. This Note additionally offers proposals for creating enforceable terms of service online and for strengthening copyright to make it applicable to user-based online platforms
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Online Search And Society: Could Your Best Friend Be Your Worst Enemy?
Online search is becoming the main source individuals
use to find information about sports, politics, health, religion,
world issues, and other subjects that shape our views on
the world and how we live our lives. Of all internet users, 92%
use online search and are doing so on desktop and mobile,
with an average of 129 searches a month per person.
Search is designed to keep users engaged and serviced with
speed and brevity. As search engine usage increases around
the world and impact on behaviours becomes more of a
concern, we must understand how might the design of search
engine algorithms be affecting society’s ability to shape the
way we see the world. Is commerce compromising community
in user experience and design? Are we unknowingly being
sent into echo chambers with predictive and personalized
search algorithms. Is the fast and wide internet actually narrowing
the doors of perception we have been walking through
online for the last 30 years?
It is the right time for through exploratory research to better
understand the current and potential future impacts and
implications of search on society and citizens. I will employ
a literature review, first party participant research and document
a chronology of knowledge discovery and capture
in context to searching, sharing and storing of information,
along with a horizon scanning exercise with a focus on trends
research. The first-party human-based research will involve
the segmentation of Digital Natives and Digital Immigrants to
explore whether there are patterns emerging within distinct
age groups. These methods will be deployed and findings will
be analyzed to ascertain what the issues might be and whether
people understand the complexities, powers, and abilities
of search engines
Recommended from our members
The Corpus Expansion Toolkit: finding what we want on the web
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:
1. Domain known – source known
2. Domain known – source unknown
3. Domain unknown – source unknown
First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system
- …