7,419 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis
Revelations of large scale electronic surveillance and data mining by
governments and corporations have fueled increased adoption of HTTPS. We
present a traffic analysis attack against over 6000 webpages spanning the HTTPS
deployments of 10 widely used, industry-leading websites in areas such as
healthcare, finance, legal services and streaming video. Our attack identifies
individual pages in the same website with 89% accuracy, exposing personal
details including medical conditions, financial and legal affairs and sexual
orientation. We examine evaluation methodology and reveal accuracy variations
as large as 18% caused by assumptions affecting caching and cookies. We present
a novel defense reducing attack accuracy to 27% with a 9% traffic increase, and
demonstrate significantly increased effectiveness of prior defenses in our
evaluation context, inclusive of enabled caching, user-specific cookies and
pages within the same website
EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets
This article introduces a new language-independent approach for creating a
large-scale high-quality test collection of tweets that supports multiple
information retrieval (IR) tasks without running a shared-task campaign. The
adopted approach (demonstrated over Arabic tweets) designs the collection
around significant (i.e., popular) events, which enables the development of
topics that represent frequent information needs of Twitter users for which
rich content exists. That inherently facilitates the support of multiple tasks
that generally revolve around events, namely event detection, ad-hoc search,
timeline generation, and real-time summarization. The key highlights of the
approach include diversifying the judgment pool via interactive search and
multiple manually-crafted queries per topic, collecting high-quality
annotations via crowd-workers for relevancy and in-house annotators for
novelty, filtering out low-agreement topics and inaccessible tweets, and
providing multiple subsets of the collection for better availability. Applying
our methodology on Arabic tweets resulted in EveTAR , the first
freely-available tweet test collection for multiple IR tasks. EveTAR includes a
crawl of 355M Arabic tweets and covers 50 significant events for which about
62K tweets were judged with substantial average inter-annotator agreement
(Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating
existing algorithms in the respective tasks. Results indicate that the new
collection can support reliable ranking of IR systems that is comparable to
similar TREC collections, while providing strong baseline results for future
studies over Arabic tweets
Tailored retrieval of health information from the web for facilitating communication and empowerment of elderly people
A patient, nowadays, acquires health information from the Web mainly through a “human-to-machine”
communication process with a generic search engine. This, in turn, affects, positively or negatively, his/her
empowerment level and the “human-to-human” communication process that occurs between a patient and a
healthcare professional such as a doctor. A generic communication process can be modelled by considering
its syntactic-technical, semantic-meaning, and pragmatic-effectiveness levels and an efficacious
communication occurs when all the communication levels are fully addressed. In the case of retrieval of health
information from the Web, although a generic search engine is able to work at the syntactic-technical level,
the semantic and pragmatic aspects are left to the user and this can be challenging, especially for elderly
people. This work presents a custom search engine, FACILE, that works at the three communication levels
and allows to overcome the challenges confronted during the search process. A patient can specify his/her
information requirements in a simple way and FACILE will retrieve the “right” amount of Web content in a
language that he/she can easily understand. This facilitates the comprehension of the found information and
positively affects the empowerment process and communication with healthcare professionals
Evaluation of the Impact of Engineering Education Research Grants Using Software Tools: A Foundation
The goal of our project was to provide the NSF with a software suite which evaluates the impact of engineering education research grants. We assisted the NSF by identifying interactions that influence the impact of grants and any measureable data within these interactions. We presented three deliverables and examined software tools to collect, organize, analyze, and visualize the quantifiable data within the interactions. Our endeavors serve as a framework for future investigation into grant impact evaluation
- …