Search CORE

1,009 research outputs found

On the Change in Archivability of Websites Over Time

Author: B. Parmanto
E. Crook
F. McCown
F. McCown
M. Prellwitz
P. Likarish
S.G. Ainsworth
W. Chisholm
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page's accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing mementos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation.Comment: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malt

arXiv.org e-Print Archive

CiteSeerX

Crossref

Old Dominion University

Enhance Crawler For Efficiently Harvesting Deep Web Interfaces

Author: Sujata R. Gutte, Shubhangi S. Gujar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/10/2017
Field of study

Scenario in web is varying quickly and size of web resources is rising, efficiency has become a challenging problem for crawling such data. The hidden web content is the data that cannot be indexed by search engines as they always stay behind searchable web interfaces. The proposed system purposes to develop a framework for focused crawler for efficient gathering hidden web interfaces. Firstly Crawler performs site-based searching for getting center pages with the help of web search tools to avoid from visiting additional number of pages. To get more specific results for a focused crawler, projected crawler ranks websites by giving high priority to more related ones for a given search. Crawler accomplishes fast in-site searching via watching for more relevant links with an adaptive link ranking. Here we have incorporated spell checker for giving correct input and apply reverse searching with incremental site prioritizing for wide-ranging coverage of hidden web sites

International Journal on Recent and Innovation Trends in Computing and Communication

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

A Review on Web Crawling System for Web Databases

Author: Ms.Pooja Yadav, Dr Gundeep Tanwar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/03/2018
Field of study

As deep web develops at a quick pace, there has been expanded enthusiasm for strategies that assistance effectively find deep-web interfaces. Nonetheless, because of the extensive volume of web assets and the dynamic idea of deep web, accomplishing wide scope and high effectiveness is a testing issue. In this task propose a three-stage framework, for proficient reaping deep web interfaces. In the principal stage, web crawler performs website based scanning for focus pages with the assistance of web search tools, abstaining from going by a substantial number of pages. In this paper we have made an overview on how web crawler functions and what are the approaches accessible in existing framework from various scientists

International Journal on Recent and Innovation Trends in Computing and Communication

An Efficient Method for Deep Web Crawler based on Accuracy

Author: Pranali Zade, Dr. S.W Mohod
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/04/2018
Field of study

As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a three-stage framework, for efficient harvesting deep web interfaces.Project experimental results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers using Na�ve Bayes algorithm. In this paper we have made a survey on how web crawler works and what are the methodologies available in existing system from different researchers

International Journal on Future Revolution in Computer Science & Communication Engineering

Efficient Deep-Web-Harvesting Using Advanced Crawler

Author: Gaikwad Dhananjay M., Tanpure Navnath B., Gulaskar Sangram S., Bakale Avinash D., Prof. Shaikh I. R.
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/09/2015
Field of study

Due to heavy usage of internet large amount of diverse data is spread over it which provides access to particular data or to search most relevant data. It is very challenging for search engine to fetch relevant data as per user’s need and which consumes more time. So, to reduce large amount of time spend on searching most relevant data we proposed the “Advanced crawler”. In this proposed approach, results collected from different web search engines to achieve meta search approach. Multiple search engine for the user query and aggregate those result in one single space and then performing two stages crawling on that data or Urls. In which the sight locating and in-site exploring is done f or achieving most relevant site with the help of page ranking and reverse searching techniques. This system also works online and offline manner

International Journal on Recent and Innovation Trends in Computing and Communication