In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance "Denzel Washington", "Iran Nuclear Deal", or "FC Barcelona" from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query

Hiemstra, Djoerd

Keulen, Maurice van

Khelghati, Mohammadreza

van Keulen, Maurice

University of Twente Research Information

Harvesting All Matching Information To AGiven Query From a Deep WebsiteMohammadreza Khelghati1, Djoerd Hiemstra1, and Maurice van Keulen1{s.m.khelghati, d.hiemstra, m.vankeulen}@utwente.nl1Databases Group, University of TwenteEnschede, NetherlandsAbstract. In this paper, the goal is harvesting all documents matchinga given (entity) query from a deep web source. The objective is to retrieveall information about for instance “Denzel Washington”, “Iran NuclearDeal”, or “FC Barcelona” from data hidden behind web forms. Policiesof web search engines usually do not allow accessing all of the matchingquery search results for a given query. They limit the number of returneddocuments and the number of user requests. In this work, we proposea new approach which automatically collects information related to agiven query from a search engine, given the search engine’s limitations.The approach minimizes the number of queries that need to be sent byapplying information from a large external corpus. The new approachoutperforms existing approaches when tested on Google, measuring thetotal number of unique documents found per query.1 IntroductionThe goal of this research is to harvest all documents matching a given (entity)query from a deep web source. For instance, we aim at retrieving all informationabout “Denzil Washington”, “Iran Nuclear Deal”, or “FC Barcelona” from datahidden behind web forms. However, policies of search engines usually do notallow accessing all of the matching query search results for a given query. Theylimit the number of returned documents (#ResultsLimited) and the number ofuser requests (#RequestsLimited).Given these search engines limitations, we propose a new approach whichautomatically collects information related to a given query from a search engine.To do so, we rely on search refinement techniques to uncover results beyond whata search engine allows a user to directly access due to #ResultsLimited and#RequestsLimited limitations. These techniques are typically based on addingextra terms to the initial query to obtain refined search results. We proposean approach which refines search results for the purpose of achieving full datacoverage.In this approach, reformulating queries should be carried out with the aim ofobtaining as many new results as possible for each query. Maximizing the numberof new results means submitting queries which return as many documents as#ResultsLimited limitation allows while minimizing the number of duplicates.Minimizing duplicates becomes complicated with the presence of ranking biasand query bias [3]. Search engine’s ranking algorithms (e.g. Google Page-rank)and selection of the initial query favour some documents more than others to bereturned by the search engine.To meet this challenge, techniques from Deep Web harvesting [1, 12, 10, 4, 2],Query-Based Sampling [5, 3], Topical Crawling, [14, 16], and Query Expansion [6,13, 7–9] are studied. Based on these studies, several approaches are suggested,implemented, and compared in this paper. We test our approaches on Google,which claims to search 100 PB of Web data (60 trillion URLs)1. Google imposesboth #ResultsLimited and #RequestsLimited, and ranking bias through itsPage-Rank algorithm.2 Suggested ApproachTo reach this data coverage, we send automatically generated queries to a searchengine’s API with the goal of retrieving all documents that contain a given entitywith a minimum amount of query submissions. We compare the approaches bytheir capabilities to deal with #ResultsLimited and #RequestsLimited. Thecomparison is based on the average number of queries submitted to retrieve alldocuments for a given query.We distinguish two kinds of approaches. Section 2.1 describes ideal approaches,for which we estimate the number of queries needed in ideal (simulated) con-ditions. Section 2.2 describes approaches in which queries are reformulated byusing an external corpus.2.1 Ideal ApproachesThe approach mentioned in this section is desirable or perfect but not easilyrealized. This is investigated with the sole purpose of improving the comparisonof the introduced approaches.Oracle Perfect Approach To achieve a full data coverage on a given en-tity in a search system with the #ResultsLimited and #RequestsLimitedlimitations, the perfect approach is the one which returns not only the maxi-mum possible number of documents but also only unique ones for each request.To have a complete coverage in this situation, it is adequate to send only the|CollectionSizeForQuery|allowedDocsToBeV isited number of requests. In reality, this is not easily reach-able. To do so, you need to know the exact mechanism of search engine rank-ing algorithm. Then, you might be able to divide the collection into exactly||CollectionSizeForQuery||allowedDocsToBeV isited sub-collections. In addition to the knowledge of rankingalgorithm, you might need additional information. For instance, if a ranking al-gorithm is based on terms frequencies, you need to know all the term frequencies1 Official Google Blog: http://googleblog.blogspot.nl/2008/07/we-knew-web-was-big.htmlbeforehand. This kind of information is only accessible when you have full indexaccess.2.2 List-Based Query Generation ApproachIn these approaches, the terms to be added to the seed query are selected from alist of words. This list is generated from an external corpus and includes the fre-quencies in that corpus. In this paper, this list is extracted from the ClueWeb09dataset, which is a web crawl containing nearly 500 million English pages [15].Selecting terms from the list of terms and their corresponding document fre-quencies can be performed in different methods. In the following, these methodsare further explained and studied.List-Based Most/Least Frequent Approach Although primitive, choosingthe most or least frequent words from a list are possible options in selectingterms. As the ClueWeb dataset is not a topic-specific corpus, the most frequentwords from this corpus are highly probable to be also general in all other nottopic-specific corpora.Pre-determined Frequency Based Approach While submitting the mostfrequent terms increases the chance of reaching the maximum number of returnedresults and the least frequent ones increases the probabilities of generating fewerduplicates, it is of a great interest to investigate the likelihood of finding a termfrequency which creates a trade-off between these two. To do so, statistical for-mulas are applicable. If events A and B are independent, then the probabilityof them both occurring is the product of the probabilities of each occurring(P (A&B) = P (A) ∗ P (B)). With samples smaller than 10 percent of the col-lection, we can assume two posing query processes as statistically independentevents (”The 10% Condition”). Then, the probability of having an overlap be-tween two queries equals with the multiplication of the probability of each query.This is shown in Formula 1.|MatchingDocs ∩ReturnedDocs||SearchEngine|=|MatchingDocs||SearchEngine|∗ |ReturnedDocs||SearchEngine||ReturnedDocs| = l ∗ |SearchEngine||MatchingDocs|| (|MatchingDocs∩ReturnedDocs| = l)(1)With the knowledge of targeted search engine’s index size, and also the num-ber of documents matching seed query, through Formula 1, one can determinethe frequency of another query for which the overlap of this query and the seedquery equals the number of documents allowed to be visited. This means withinformation on the seed query, returned results and search engine size, a termcan be found to formulate a new query returning at least the same numberof results that are allowed to be visited. This enables avoiding the permanentpresence of the same highly ranked documents among the results and createsa higher chance in collecting more new documents in each trial. If the size ofsearch engine is unknown, as discussed in [11], the size can be estimated by onlyusing a few number of generated samples from search engine.As pointed out, applying this formula to our case requires information onterms document frequencies. To access this information from the targeted searchsystem, we should download all its content and count all the terms document fre-quencies. If this was possible, there was no need for introducing these approaches.Instead, we can use pre-computed terms document frequencies from an externalcorpus. In this paper, as we test our approaches on Google, we use the ClueWebdataset. However, the size difference between the ClueWeb and Google should beconsidered to be able to apply the formula. The easiest solution is to include dif-ferent sizes in the calculations. For example, assuming SizeSearchEngine = 109,the number of English documents in ClueWeb as 5×108, limitedResults = 100,and |MatchingDocuments| for a given query to be 4× 105, the following calcu-lation could provide us with a term document frequency that has higher chanceto result in samples of our desired size: 100109 =4×105109 ∗x5×108 =⇒ x = 125000. Inthis paper, we refer to this approach as LB-FixedFreq. approach.3 Experiments and Results3.1 Experiments SettingsTest Search Engine In these experiments, Google as the biggest web search en-gine with one of the most complicated ranking algorithms is considered as ourtest search engine. As the only necessary feature for applying any of these sug-gested approaches is the support of keyword-search interface, targeting Googledoes not limit our findings only to Google.Entities Test Set In our experiments, we used four different entities (“Vitol”,“Ed Brinksma”, “PhD Comics”, and “Fireworks Disaster”) to test and com-pare the suggested approaches. We tried to include entities representing differenttypes of entities; Company, Person, Topic, and Event. In addition to differencein type, we tried to cover queries with different estimated results sets sizes.3.2 ResultsIn this section, the results of applying the introduced approaches in Section2 to the test entities (Section 3.1) are presented. The Figure 1 compares theperformances of all the approaches for one of the test entities in the test set.This is a straight forward task as it is only required to compare the numberof retrieved documents by each approach. However, to compare the approaches’performances on all the test cases, we calculate their average distances from theOracle Perfect approach. In Figure 3, the performances of all the approaches forall the entities in the test set are compared with the Oracle Perfect approach.●●●●●●●●● ●●●●●●●●●●■■■■■■■■■■■■■ ■■ ■■■ ■◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲0 50 100 150050001000015000n (Number of Submitted Queries )RetrievedDocumentsForOneEntity● LB-FixedFreq .■ LB-LeastFreq .◆ LB-MostFreq .▲ Oracle PerfectFig. 1. Average Performance For All One Entity0 50 100 150020406080100n (Number of Submitted Queries )DuplicatesLB-LeastFreq .LB-MostFreq .LB-FixedFreq .0 50 100 150020406080100n (Number of Submitted Queries )SampleSizeLB-LeastFreq .LB-MostFreq .LB-FixedFreq .Fig. 2. Sample Sizes and Duplicates For Approaches For One EntityAs it is shown in Fig 3, the LB-FixedFreq. approach performs better thanMost-freq. and Least-Freq. approaches. This approach submits queries which re-sult in fewer duplicates than LB-MostFreq. approach while having bigger samplesizes in regards to the Least-Freq approach. This is observable from Figure 2.The right image in this figure shows the number of duplicates resulted from sub-mitting all the queries formulated by adding a term to the initial query (givenentity). The left picture shows the corresponding sample size for each of thesequeries. From comparing these two images, we can conclude that a trade-offbetween the big sample sizes and number of duplicates is the key to the LB-MostFreq. approach’s better performance. In this approach, finding a specificfrequency leads to a trade-off between sample sizes and number of duplicates.●●●● ● ●●●● ● ● ● ●● ● ● ● ● ●■■■■■■■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■◆◆◆◆◆◆◆◆ ◆ ◆◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲0 50 100 150020406080100n (Number of Submitted Queries )Performance(%differencefromIdealAppr.)● LB-FixedFreq .■ LB-LeastFreq .◆ LB-MostFreq .▲ Oracle PerfectFig. 3. Average Performance For All The Entities4 Conclusion and Future WorkIn this work, we assessed different query generation mechanisms for harvestinga web data source to move forward towards achieving a full data coverage ona given (entity) query. From the experiments, we found that the key to successin these approaches is to send queries which result in the maximum possiblenumber of results with the minimum possible number of documents downloadedin previous query submissions. To have this success factor, we suggested threedifferent approaches based on different frequencies. Among these approaches,the LB-FixedFreq. performed better than the others.Future Work In addition to the frequency of terms extracted from an externalcorpus, we can include terms present in the previously retrieved documents toselect the best next query to submit. The frequency of these terms could also beapplied for a more efficient query expansion technique.5 AcknowledgementThis publication was supported by the Dutch national program COMMIT.References1. Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, andVı́ctor Carneiro. Deepbot: a focused crawler for accessing hidden web content.In Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Com-merce (EC ’07), DEECS ’07, pages 18–25, New York, NY, USA, 2007. ACM.2. Luciano Barbosa and Juliana Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, pages 309–321, 2004.3. Krishna Bharat and Andrei Broder. A technique for measuring the relative sizeand overlap of public web search engines. Comput. Netw. ISDN Syst., 30:379–388,April 1998.4. Michael Cafarella. Extracting and Querying a Comprehensive Web Database. InProceedings of the Conference on Innovative Data Systems Research (CIDR), 2009.5. James P. Callan and Margaret E. Connell. Query-based sampling of text databases.ACM Trans. Inf. Syst., 19(2):97–130, 2001.6. Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. Selectinggood expansion terms for pseudo-relevance feedback. In Proceedings of the 31stAnnual International ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’08, pages 243–250, New York, NY, USA, 2008. ACM.7. Claudio Carpineto and Giovanni Romano. A survey of automatic query expansionin information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, January 2012.8. Kevyn Collins-Thompson and Jamie Callan. Estimation and use of uncertainty inpseudo-relevance feedback. In Proceedings of the 30th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR’07, pages 303–310, New York, NY, USA, 2007. ACM.9. Ben He and Iadh Ounis. Combining fields for query expansion and adaptive queryexpansion. Inf. Process. Manage., 43(5):1294–1307, September 2007.10. Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. Crawl-ing deep web entity pages. In Proceedings of the Sixth ACM International Con-ference on Web Search and Data Mining, WSDM ’13, pages 355–364, New York,NY, USA, 2013. ACM.11. Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. Size esti-mation of non-cooperative data collections. IIWAS ’12, pages 239–246, New York,NY, USA, 2012. ACM.12. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen,and Alon Halevy. Google’s Deep Web crawl. Proc. VLDB Endow., 1(2):1241–1252, August 2008.13. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introductionto Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.14. Filippo Menczer, Gautam Pant, and Padmini Srinivasan. Topical web crawlers:Evaluating adaptive algorithms. ACM Transactions on Internet Technology,4:http://dollar.biz.ui, 2004.15. The Lemur Project. A dataset to support research on information retrieval and re-lated human language technologies. http://lemurproject.org/clueweb09.php, 2014.16. Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum, Jens Graup-mann, Michael Biwer, and Patrick Zimmer. The bingo! system for informationportal generation and expert web search. In CIDR, 2003.

English

Harvesting All Matching Information To A Given Query From a Deep Website

Contains fulltext :
                  227668.pdf (publisher's version ) (Open Access)KDWEB 201

Harvesting all matching information to a given query from a deep website

Abstract

Similar works

Full text

Available Versions

University of Twente Research Information

Radboud Repository