Search CORE

7,419 research outputs found

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis

Author: Huang Ling
Joseph A. D.
Miller Brad
Tygar J. D.
Publication venue
Publication date: 01/01/2014
Field of study

Revelations of large scale electronic surveillance and data mining by governments and corporations have fueled increased adoption of HTTPS. We present a traffic analysis attack against over 6000 webpages spanning the HTTPS deployments of 10 widely used, industry-leading websites in areas such as healthcare, finance, legal services and streaming video. Our attack identifies individual pages in the same website with 89% accuracy, exposing personal details including medical conditions, financial and legal affairs and sexual orientation. We examine evaluation methodology and reveal accuracy variations as large as 18% caused by assumptions affecting caching and cookies. We present a novel defense reducing attack accuracy to 27% with a 9% traffic increase, and demonstrate significantly increased effectiveness of prior defenses in our evaluation context, inclusive of enabled caching, user-specific cookies and pages within the same website

arXiv.org e-Print Archive

CiteSeerX

Crossref

EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

Author: A Bruns
AM Azmi
BS Wasike
D Bodoff
D Elsweiler
Hind Almerekhi
J Benhardus
JL Fleiss
JR Landis
K Darwish
M Efron
M Rowe
M Sanderson
Maram Hasanain
Mucahid Kutlu
Reem Suwaileh
RL Brennan
Tamer Elsayed
W Magdy
Zhang Y
Publication venue
Publication date: 21/08/2017
Field of study

This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

arXiv.org e-Print Archive

Qatar University Institutional Repository

Crossref

Tailored retrieval of health information from the web for facilitating communication and empowerment of elderly people

Author: Alfano Marco
Helfert Markus
Lenzitti Biagio
Taibi Davide
Publication venue: 'Scitepress'
Publication date: 01/01/2020
Field of study

A patient, nowadays, acquires health information from the Web mainly through a “human-to-machine” communication process with a generic search engine. This, in turn, affects, positively or negatively, his/her empowerment level and the “human-to-human” communication process that occurs between a patient and a healthcare professional such as a doctor. A generic communication process can be modelled by considering its syntactic-technical, semantic-meaning, and pragmatic-effectiveness levels and an efficacious communication occurs when all the communication levels are fully addressed. In the case of retrieval of health information from the Web, although a generic search engine is able to work at the syntactic-technical level, the semantic and pragmatic aspects are left to the user and this can be challenging, especially for elderly people. This work presents a custom search engine, FACILE, that works at the three communication levels and allows to overcome the challenges confronted during the search process. A patient can specify his/her information requirements in a simple way and FACILE will retrieve the “right” amount of Web content in a language that he/she can easily understand. This facilitates the comprehension of the found information and positively affects the empowerment process and communication with healthcare professionals

Crossref

Irish Universities

DCU Online Research Access Service

Evaluation of the Impact of Engineering Education Research Grants Using Software Tools: A Foundation

Author: Fu Ningxin
Janeczek Craig David
Kimball John Henry
Spetka Kaitlyn Elizabeth
Publication venue: Digital WPI
Publication date: 16/12/2010
Field of study

The goal of our project was to provide the NSF with a software suite which evaluates the impact of engineering education research grants. We assisted the NSF by identifying interactions that influence the impact of grants and any measureable data within these interactions. We presented three deliverables and examined software tools to collect, organize, analyze, and visualize the quantifiable data within the interactions. Our endeavors serve as a framework for future investigation into grant impact evaluation

DigitalCommons@WPI