101,577 research outputs found

    What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

    Full text link
    "© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PUBLICATION, {VOL 13, ISS 2, (APR 2019)} http://doi.acm.org/10.1145/3316810"[EN] A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Ciencia, Innovacion y Universidades/AEI under grant TIN2016-76843-C4-1-R and by the Generalitat Valenciana under grants PROMETEO-II/2015/013 (SmartLogic) and Prometeo/2019/098 (DeepTrust).Alarte, J.; Silva, J.; Tamarit Muñoz, S. (2019). What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors. ACM Transactions on the Web. 13(2):9:1-9:19. https://doi.org/10.1145/3316810S9:19:19132Alarte, J., Insa, D., Silva, J., & Tamarit, S. (2015). TeMex. Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion. doi:10.1145/2740908.2742835Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49. Julián Alarte David Insa Josep Silva and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing Cham 36--49.Alassi, D., & Alhajj, R. (2013). Effectiveness of template detection on noise reduction and websites summarization. Information Sciences, 219, 41-72. doi:10.1016/j.ins.2012.07.022Bar-Yossef, Z., & Rajagopalan, S. (2002). Template detection via data mining and its applications. Proceedings of the eleventh international conference on World Wide Web - WWW ’02. doi:10.1145/511446.511522Chakrabarti, D., Kumar, R., & Punera, K. (2007). Page-level template detection via isotonic smoothing. Proceedings of the 16th international conference on World Wide Web - WWW ’07. doi:10.1145/1242572.1242582Chen, L., Ye, S., & Li, X. (2006). Template detection for large scale search engines. Proceedings of the 2006 ACM symposium on Applied computing - SAC ’06. doi:10.1145/1141277.1141534Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05. doi:10.1145/1062745.1062763Kim, C., & Shim, K. (2011). TEXT: Automatic Template Extraction from Heterogeneous Web Pages. IEEE Transactions on Knowledge and Data Engineering, 23(4), 612-626. doi:10.1109/tkde.2010.140Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC. Barbara Ann Kitchenham David Budgen and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC.Kołcz, A., & Yih, W. (s. f.). Site-Independent Template-Block Detection. Lecture Notes in Computer Science, 152-163. doi:10.1007/978-3-540-74976-9_17Kohlschütter, C. (2009). A densitometric analysis of web template content. Proceedings of the 18th international conference on World wide web - WWW ’09. doi:10.1145/1526709.1526909Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55 Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications Stéphane Bressan Josef Küng and Roland Wagner (Eds.). Springer Berlin 560--571. 10.1007/11827405_55Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ. Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks Contents and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc. Secaucus NJ.Liu, L., Han, W., Buttler, D., Pu, C., & Tang, W. (1999). An XJML-based wrapper generator for Web information extraction. Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD ’99. doi:10.1145/304182.304570Ma, L., Goharian, N., Chowdhury, A., & Chung, M. (2003). Extracting unstructured data from template generated web documents. Proceedings of the twelfth international conference on Information and knowledge management - CIKM ’03. doi:10.1145/956863.956961Manjula, R., & Chilambuchelvan, A. (2013). Extracting templates from Web pages. 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE). doi:10.1109/icgce.2013.6823541Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY. Christopher D. Manning Prabhakar Raghavan and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press New York NY.Meng, X., Hu, D., & Li, C. (2003). Schema-guided wrapper maintenance for web-data extraction. Proceedings of the fifth ACM international workshop on Web information and data management - WIDM ’03. doi:10.1145/956699.956701Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., & Bui, T. D. (2009). A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page. 2009 International Conference on Knowledge and Systems Engineering. doi:10.1109/kse.2009.39Schäfer, R. (2016). Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Language Resources and Evaluation, 51(3), 873-889. doi:10.1007/s10579-016-9359-2Sivakumar, P. (2015). Effectual Web Content Mining using Noise Removal from Web Pages. Wireless Personal Communications, 84(1), 99-121. doi:10.1007/s11277-015-2596-7Song, D., Sun, F., & Liao, L. (2013). A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowledge and Information Systems, 42(1), 75-96. doi:10.1007/s10115-013-0687-xR. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas. R. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas.Uzun, E., Agun, H. V., & Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Information Processing & Management, 49(4), 928-944. doi:10.1016/j.ipm.2013.02.005Vieira, K., da Costa Carvalho, A. L., Berlt, K., de Moura, E. S., da Silva, A. S., & Freire, J. (2009). On Finding Templates on Web Collections. World Wide Web, 12(2), 171-211. doi:10.1007/s11280-009-0059-3Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J. M. B., & Freire, J. (2006). A fast and robust method for web page template detection and removal. Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM ’06. doi:10.1145/1183614.1183654Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607. Thijs Vogels Octavian-Eugen Ganea and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607.Wang, Y., Fang, B., Cheng, X., Guo, L., & Xu, H. (2008). Incremental web page template detection. Proceeding of the 17th international conference on World Wide Web - WWW ’08. doi:10.1145/1367497.1367749Yi, L., Liu, B., & Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03. doi:10.1145/956750.956785Zheng, S., Song, R., Wen, J.-R., & Giles, C. L. (2009). Efficient record-level wrapper induction. Proceeding of the 18th ACM conference on Information and knowledge management - CIKM ’09. doi:10.1145/1645953.1645962Zheng, S., Song, R., Wen, J.-R., & Wu, D. (2007). Joint optimization of wrapper generation and template detection. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’07. doi:10.1145/1281192.128128

    An Emotional Analysis of False Information in Social Media and News Articles

    Full text link
    [EN] Fake news is risky since it has been created to manipulate the readers' opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news articles sources. Our experiments showed that false information has different emotional patterns in each of its types, and emotions play a key role in deceiving the reader. Based on that, we proposed a LSTM neural network model that is emotionally-infused to detect false news.The work of the second author was partially funded by the Spanish MICINN under the research project MISMISFAKEnHATE on Misinformation and Miscommunication in social media: FAKEnews and HATE speech (PGC2018-096212B-C31).Ghanem, BHH.; Rosso, P.; Rangel, F. (2020). An Emotional Analysis of False Information in Social Media and News Articles. ACM Transactions on Internet Technology. 20(2):1-18. https://doi.org/10.1145/3381750S118202Magda B. Arnold. 1960. Emotion and Personality. Columbia University Press. Magda B. Arnold. 1960. Emotion and Personality. Columbia University Press.Bhatt, G., Sharma, A., Sharma, S., Nagpal, A., Raman, B., & Mittal, A. (2018). Combining Neural, Statistical and External Features for Fake News Stance Identification. Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW ’18. doi:10.1145/3184558.3191577Castillo, C., Mendoza, M., & Poblete, B. (2011). Information credibility on twitter. Proceedings of the 20th international conference on World wide web - WWW ’11. doi:10.1145/1963405.1963500Chakraborty, A., Paranjape, B., Kakarla, S., & Ganguly, N. (2016). Stop Clickbait: Detecting and preventing clickbaits in online news media. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). doi:10.1109/asonam.2016.7752207Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3-4), 169-200. doi:10.1080/02699939208411068Ghanem, B., Rosso, P., & Rangel, F. (2018). Stance Detection in Fake News A Combined Feature Representation. Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). doi:10.18653/v1/w18-5510Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. doi:10.1162/neco.1997.9.8.1735Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A., … Koychev, I. (2017). Fully Automated Fact Checking Using External Sources. RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning. doi:10.26615/978-954-452-049-6_046Kumar, S., West, R., & Leskovec, J. (2016). Disinformation on the Web. Proceedings of the 25th International Conference on World Wide Web. doi:10.1145/2872427.2883085Li, X., Meng, W., & Yu, C. (2011). T-verifier: Verifying truthfulness of fact statements. 2011 IEEE 27th International Conference on Data Engineering. doi:10.1109/icde.2011.5767859Nyhan, B., & Reifler, J. (2010). When Corrections Fail: The Persistence of Political Misperceptions. Political Behavior, 32(2), 303-330. doi:10.1007/s11109-010-9112-2Plutchik, R. (2001). The Nature of Emotions. American Scientist, 89(4), 344. doi:10.1511/2001.4.344Popat, K., Mukherjee, S., Strötgen, J., & Weikum, G. (2016). Credibility Assessment of Textual Claims on the Web. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. doi:10.1145/2983323.2983661Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., & Bandyopadhyay, S. (2013). Enhanced SenticNet with Affective Labels for Concept-Based Opinion Mining. IEEE Intelligent Systems, 28(2), 31-38. doi:10.1109/mis.2013.4Rangel, F., & Rosso, P. (2016). On the impact of emotions on author profiling. Information Processing & Management, 52(1), 73-92. doi:10.1016/j.ipm.2015.06.003Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/d17-1317Ruchansky, N., Seo, S., & Liu, Y. (2017). CSI. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. doi:10.1145/3132847.3132877Tausczik, Y. R., & Pennebaker, J. W. (2009). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1), 24-54. doi:10.1177/0261927x09351676Volkova, S., Shaffer, K., Jang, J. Y., & Hodas, N. (2017). Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). doi:10.18653/v1/p17-2102Zhao, Z., Resnick, P., & Mei, Q. (2015). Enquiring Minds. Proceedings of the 24th International Conference on World Wide Web. doi:10.1145/2736277.274163

    Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter

    Full text link
    Microblogs are increasingly exploited for predicting prices and traded volumes of stocks in financial markets. However, it has been demonstrated that much of the content shared in microblogging platforms is created and publicized by bots and spammers. Yet, the presence (or lack thereof) and the impact of fake stock microblogs has never systematically been investigated before. Here, we study 9M tweets related to stocks of the 5 main financial markets in the US. By comparing tweets with financial data from Google Finance, we highlight important characteristics of Twitter stock microblogs. More importantly, we uncover a malicious practice - referred to as cashtag piggybacking - perpetrated by coordinated groups of bots and likely aimed at promoting low-value stocks by exploiting the popularity of high-value ones. Among the findings of our study is that as much as 71% of the authors of suspicious financial tweets are classified as bots by a state-of-the-art spambot detection algorithm. Furthermore, 37% of them were suspended by Twitter a few months after our investigation. Our results call for the adoption of spam and bot detection techniques in all studies and applications that exploit user-generated content for predicting the stock market

    Challenges to describe QoS requirements for web services quality prediction to support web services interoperability in electronic commerce

    Get PDF
    Quality of service (QoS) is significant and necessary for web service applications quality assurance. Furthermore, web services quality has contributed to the successful implementation of Electronic Commerce (EC) applications. However, QoS is still the big issue for web services research and remains one of the main research questions that need to be explored. We believe that QoS should not only be measured but should also be predicted during the development and implementation stages. However, there are challenges and constraints to determine and choose QoS requirements for high quality web services. Therefore, this paper highlights the challenges for the QoS requirements prediction as they are not easy to identify. Moreover, there are many different perspectives and purposes of web services, and various prediction techniques to describe QoS requirements. Additionally, the paper introduces a metamodel as a concept of what makes a good web service

    Cache Serializability: Reducing Inconsistency in Edge Transactions

    Full text link
    Read-only caches are widely used in cloud infrastructures to reduce access latency and load on backend databases. Operators view coherent caches as impractical at genuinely large scale and many client-facing caches are updated in an asynchronous manner with best-effort pipelines. Existing solutions that support cache consistency are inapplicable to this scenario since they require a round trip to the database on every cache transaction. Existing incoherent cache technologies are oblivious to transactional data access, even if the backend database supports transactions. We propose T-Cache, a novel caching policy for read-only transactions in which inconsistency is tolerable (won't cause safety violations) but undesirable (has a cost). T-Cache improves cache consistency despite asynchronous and unreliable communication between the cache and the database. We define cache-serializability, a variant of serializability that is suitable for incoherent caches, and prove that with unbounded resources T-Cache implements this new specification. With limited resources, T-Cache allows the system manager to choose a trade-off between performance and consistency. Our evaluation shows that T-Cache detects many inconsistencies with only nominal overhead. We use synthetic workloads to demonstrate the efficacy of T-Cache when data accesses are clustered and its adaptive reaction to workload changes. With workloads based on the real-world topologies, T-Cache detects 43-70% of the inconsistencies and increases the rate of consistent transactions by 33-58%.Comment: Ittay Eyal, Ken Birman, Robbert van Renesse, "Cache Serializability: Reducing Inconsistency in Edge Transactions," Distributed Computing Systems (ICDCS), IEEE 35th International Conference on, June~29 2015--July~2 201
    • …
    corecore