36,320 research outputs found

    What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

    Full text link
    "© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PUBLICATION, {VOL 13, ISS 2, (APR 2019)} http://doi.acm.org/10.1145/3316810"[EN] A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Ciencia, Innovacion y Universidades/AEI under grant TIN2016-76843-C4-1-R and by the Generalitat Valenciana under grants PROMETEO-II/2015/013 (SmartLogic) and Prometeo/2019/098 (DeepTrust).Alarte, J.; Silva, J.; Tamarit Muñoz, S. (2019). What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors. ACM Transactions on the Web. 13(2):9:1-9:19. https://doi.org/10.1145/3316810S9:19:19132Alarte, J., Insa, D., Silva, J., & Tamarit, S. (2015). TeMex. Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion. doi:10.1145/2740908.2742835Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49. Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49.Alassi, D., & Alhajj, R. (2013). Effectiveness of template detection on noise reduction and websites summarization. Information Sciences, 219, 41-72. doi:10.1016/j.ins.2012.07.022Bar-Yossef, Z., & Rajagopalan, S. (2002). Template detection via data mining and its applications. Proceedings of the eleventh international conference on World Wide Web - WWW ’02. doi:10.1145/511446.511522Chakrabarti, D., Kumar, R., & Punera, K. (2007). Page-level template detection via isotonic smoothing. Proceedings of the 16th international conference on World Wide Web - WWW ’07. doi:10.1145/1242572.1242582Chen, L., Ye, S., & Li, X. (2006). Template detection for large scale search engines. Proceedings of the 2006 ACM symposium on Applied computing - SAC ’06. doi:10.1145/1141277.1141534Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05. doi:10.1145/1062745.1062763Kim, C., & Shim, K. (2011). TEXT: Automatic Template Extraction from Heterogeneous Web Pages. IEEE Transactions on Knowledge and Data Engineering, 23(4), 612-626. doi:10.1109/tkde.2010.140Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC. Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC.Kołcz, A., & Yih, W. (s. f.). Site-Independent Template-Block Detection. Lecture Notes in Computer Science, 152-163. doi:10.1007/978-3-540-74976-9_17Kohlschütter, C. (2009). A densitometric analysis of web template content. Proceedings of the 18th international conference on World wide web - WWW ’09. doi:10.1145/1526709.1526909Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55 Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ. Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ.Liu, L., Han, W., Buttler, D., Pu, C., & Tang, W. (1999). An XJML-based wrapper generator for Web information extraction. Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD ’99. doi:10.1145/304182.304570Ma, L., Goharian, N., Chowdhury, A., & Chung, M. (2003). Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management - CIKM ’03. doi:10.1145/956863.956961Manjula, R., & Chilambuchelvan, A. (2013). Extracting templates from Web pages. 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE). doi:10.1109/icgce.2013.6823541Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY. Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY.Meng, X., Hu, D., & Li, C. (2003). Schema-guided wrapper maintenance for web-data extraction. Proceedings of the fifth ACM international workshop on Web information and data management - WIDM ’03. doi:10.1145/956699.956701Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Bui, T. D. (2009). A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page. 2009 International Conference on Knowledge and Systems Engineering. doi:10.1109/kse.2009.39Schäfer, R. (2016). Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Language Resources and Evaluation, 51(3), 873-889. doi:10.1007/s10579-016-9359-2Sivakumar, P. (2015). Effectual Web Content Mining using Noise Removal from Web Pages. Wireless Personal Communications, 84(1), 99-121. doi:10.1007/s11277-015-2596-7Song, D., Sun, F., & Liao, L. (2013). A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), 75-96. doi:10.1007/s10115-013-0687-xR. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas. R. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas.Uzun, E., Agun, H. V., & Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Information Processing & Management, 49(4), 928-944. doi:10.1016/j.ipm.2013.02.005Vieira, K., da Costa Carvalho, A. L., Berlt, K., de Moura, E. S., da Silva, A. S., & Freire, J. (2009). On Finding Templates on Web Collections. World Wide Web, 12(2), 171-211. doi:10.1007/s11280-009-0059-3Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J. M. B., & Freire, J. (2006). A fast and robust method for web page template detection and removal. Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM ’06. doi:10.1145/1183614.1183654Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607. Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607.Wang, Y., Fang, B., Cheng, X., Guo, L., & Xu, H. (2008). Incremental web page template detection. Proceeding of the 17th international conference on World Wide Web - WWW ’08. doi:10.1145/1367497.1367749Yi, L., Liu, B., & Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03. doi:10.1145/956750.956785Zheng, S., Song, R., Wen, J.-R., & Giles, C. L. (2009). Efficient record-level wrapper induction. Proceeding of the 18th ACM conference on Information and knowledge management - CIKM ’09. doi:10.1145/1645953.1645962Zheng, S., Song, R., Wen, J.-R., & Wu, D. (2007). Joint optimization of wrapper generation and template detection. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’07. doi:10.1145/1281192.128128

    TeMex: The Web Template Extractor

    Full text link
    "© ACM} 2015. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the 24th International Conference on World Wide Web (pp. 155-158), http://dx.doi.org/10.1145/2740908.2742835This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a prede- fined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad (Secretaría de Estado de Investigación, Desarrollo e Innovación) under Grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under Grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Educación under FPU Grant AP2010-4415. Salvador Tamarit was partially supported by research project POLCA, Programming Large Scale Heterogeneous Infrastructures (610686), funded by the European Union, STREP FP7.Alarte, J.; Insa Cabrera, D.; Silva Galiana, JF.; Tamarit Muñoz, S. (2015). TeMex: The Web Template Extractor. ACM. https://doi.org/10.1145/2740908.2742835SOverlay extension. Available from URL: https://developer.mozilla.org/en-US/Add-ons/Overlay_Extensions, 2005.J. Alarte, D. Insa, J. Silva, and S. Tamarit. Automatic Detection of Webpages that Share the Same Web Template. In M. H. ter Beek and A. Ravara, editors, Proceedings of the 10th International Workshop on Automated Specification and Verification of Web Systems (WWV 14), volume 163 of Electronic Proceedings in Theoretical Computer Science, pages 2--15. Open Publishing Association, July 2014.J. Alarte, D. Insa, J. Silva, and S. Tamarit. A Benchmark Suite for Template Detection and Content Extraction. CoRR, abs/1409.6182, 2014.Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02), pages 580--591, New York, NY, USA, 2002. ACM.M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), pages 638--643. European Language Resources Association, may 2008.D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In A. Ellis and T. Hagino, editors, Proceedings of the 14th International Conference on World Wide Web (WWW'05), pages 830--839. ACM, may 2005.T. Gottron. Evaluating content extraction on HTML documents. In V. Grout, D. Oram, and R. Picking, editors, Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA'07), pages 123--132. National Assembly for Wales, sep 2007.D. d. C. Reis, P. B. Golgher, A. S. Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web (WWW'04), pages 502--511, New York, NY, USA, 2004. ACM.K. Vieira, A. L. da Costa Carvalho, K. Berlt, E. S. de Moura, A. S. da Silva, and J. Freire. On finding templates on web collections. World Wide Web, 12(2):171--211, 2009.K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. a. M. B. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), pages 258--267, New York, NY, USA, 2006. ACM.T. Weninger, W. Henry Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web (WWW'10), pages 971--980. ACM, apr 2010.L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), pages 296--305, New York, NY, USA, 2003. ACM

    Mining Object Parts from CNNs via Active Question-Answering

    Full text link
    Given a convolutional neural network (CNN) that is pre-trained for object classification, this paper proposes to use active question-answering to semanticize neural patterns in conv-layers of the CNN and mine part concepts. For each part concept, we mine neural patterns in the pre-trained CNN, which are related to the target part, and use these patterns to construct an And-Or graph (AOG) to represent a four-layer semantic hierarchy of the part. As an interpretable model, the AOG associates different CNN units with different explicit object parts. We use an active human-computer communication to incrementally grow such an AOG on the pre-trained CNN as follows. We allow the computer to actively identify objects, whose neural patterns cannot be explained by the current AOG. Then, the computer asks human about the unexplained objects, and uses the answers to automatically discover certain CNN patterns corresponding to the missing knowledge. We incrementally grow the AOG to encode new knowledge discovered during the active-learning process. In experiments, our method exhibits high learning efficiency. Our method uses about 1/6-1/3 of the part annotations for training, but achieves similar or better part-localization performance than fast-RCNN methods.Comment: Published in CVPR 201
    • …
    corecore