879 research outputs found

    Tutorial: Legality and Ethics of Web Scraping

    Get PDF
    Researchers and practitioners often use various tools and technologies to automatically retrieve data from the Web (often referred to as Web scraping) when conducting their projects. Unfortunately, they often overlook the legality and ethics of using these tools to collect data. Failure to pay due attention to these aspects of Web Scraping can result in serious ethical controversies and lawsuits. Accordingly, we review legal literature together with the literature on ethics and privacy to identify broad areas of concern together with a list of specific questions that researchers and practitioners engaged in Web scraping need to address. Reflecting on these questions and concerns can potentially help researchers and practitioners decrease the likelihood of ethical and legal controversies in their work

    Web Data Extraction Dalam Analitika Data Audit: Pengembangan Artefak Teknologi Dalam Perspektif Design Science Research

    Get PDF
    Perkembangan implementasi Teknologi Informasi dan Komunikasi (TIK) sebagai bagian pengendalian internal organisasi mendorong auditor mengembangkan analitika data audit (ADA/Audit Data Analytics) sebagai kerangka pengetahuan dan praktik untuk mendapatkan bukti audit dan informasi lainnya dari sekumpulan data elektronik terkait dengan pelaksanaan pada semua tahapan pekerjaan audit. Pada saat yang sama, terdapat kecenderungan organisasi untuk menyajikan datanya dengan aplikasi berbasis web. Terkait dengan keberadaan laman web sebagai sumber data (bukti audit) tersebut, telah berkembang teknik  ekstraksi data dari laman web yang disebut dengan web data extraction. Penelitian ini dengan menggunakan design science research methodology mengajukan temuan artefak yang berkaitan dengan model dan instantiasi (instantiation) web data extraction untuk implementasi ADA. Hasil penelitian ini diharapkan dapat menjadi tambahan referensi dalam ranah praktik audit berupa artefak dalam bentuk instantiasi penggunaan web data extraction untuk akusisi data sebagai bukti audit dengan sumber dari halaman web, baik dari aplikasi berbasis intranet ataupun internet. Penelitian ini juga berkontribusi dengan mengajukan kerangka praktikal implementasi web data extraction sebagai bagian dari ADA dalam melaksanakan pekerjaan audit. Selain itu, hasil kajian ini juga diharapkan menjadi referensi untuk penggunaan design science research methodology yang ternyata belum terlalu banyak diaplikasikan dalam penelitian dalam disiplin audit di Indonesia

    Information Protection in Dark Web Drug Markets Research

    Get PDF
    In recent years, there have increasingly been conflicting calls for more government surveillance online and, paradoxically, increased protection of the privacy and anonymity of individuals. Many corporations and groups globally have come under fire for sharing data with law enforcement agencies as well as for refusing to cooperate with said agencies, in order to protect their customers. In this study, we focus on Dark Web drug trading sites as an exemplary case of problematic areas of information protection, and ask what practices should be followed when gathering data from the Dark Web. Using lessons from an ongoing research project, we outline best practices for protecting the safety of the people under study on these sites without compromising the quality of research data gathering

    Web Scraping in the R Language: A Tutorial

    Get PDF
    Information Systems researchers can now more easily access vast amounts of data on the World Wide Web to answer both familiar and new questions with more rigor, precision, and timeliness. The main goal of this tutorial is to explain how Information Systems researchers can automatically “scrape” data from the web using the R programming language. This article provides a conceptual overview of the Web Scraping process. The tutorial discussion is about two R packages useful for Web Scraping: “rvest” and “xml2”. Simple examples of web scraping involving these two packages are provided. This tutorial concludes with an example of a complex web scraping task involving retrieving data from Bayt.com - a leading employment website in the Middle East

    Two Decades of Laws and Practice Around Screen Scraping in the Common Law World and Its Open Banking Watershed Moment

    Get PDF
    Screen scraping—a technique using an agent to collect, parse, and organize data from the web in an automated manner—has found countless applications over the past two decades. It is now employed everywhere, from targeted advertising, price aggregation, budgeting apps, website preservation, academic research, and journalism, to name a few. However, this tool has raised enormous controversy in the age of big data. This article takes a comparative law approach to explore two sets of analytical issues in three common law jurisdictions, the United States, the United Kingdom, and Australia. As the first step, this article maps out the trajectory of relevant laws and jurisprudence around screen scraping legality in three common law jurisdictions—the United States, the United Kingdom, and Australia. Specifically, the article focuses on five selected issue areas within those jurisdictions—“digital trespass” statutes, tort, intellectual property rights, contract, and data protection. Our findings reveal some level of divergence in the way each country addresses the legality of screen scraping. Despite such divergence, one may see a sea change amid the trend of data-sharing under the banner of “Open Banking” in coming years. This article argues that to the extent that these data sharing initiatives enable information flow between entities, it could reduce the demand for screen scraping generally, thereby bringing some level of convergence. Yet, this convergence is qualified by the institutional design of data sharing schemes—whether or not it explicitly addresses screen scraping (as in Australia and the United Kingdom) and whether there is a top-down, government-mandated data-sharing regime (as in the United States)

    Protecting Publicly Available Data With Machine Learning Shortcuts

    Full text link
    Machine-learning (ML) shortcuts or spurious correlations are artifacts in datasets that lead to very good training and test performance but severely limit the model's generalization capability. Such shortcuts are insidious because they go unnoticed due to good in-domain test performance. In this paper, we explore the influence of different shortcuts and show that even simple shortcuts are difficult to detect by explainable AI methods. We then exploit this fact and design an approach to defend online databases against crawlers: providers such as dating platforms, clothing manufacturers, or used car dealers have to deal with a professionalized crawling industry that grabs and resells data points on a large scale. We show that a deterrent can be created by deliberately adding ML shortcuts. Such augmented datasets are then unusable for ML use cases, which deters crawlers and the unauthorized use of data from the internet. Using real-world data from three use cases, we show that the proposed approach renders such collected data unusable, while the shortcut is at the same time difficult to notice in human perception. Thus, our proposed approach can serve as a proactive protection against illegitimate data crawling.Comment: Published at BMVC 202

    Productivity, Digital Footprint and Sustainability in the Textile and Clothing Industry

    Full text link
    [EN] In recent years, there has been a shift from the linear economic model on which the textile and clothing industry is based to a more sustainable model. However, to date, limited research on the relationship between sustainability commitment and firm productivity has focused on the textile and clothing industry. This study addresses this gap and aims to explore whether the digital footprint of small and medium-sized textile companies in terms of their sustainable performance is related to their productivity. To this end, the paper proposes an innovative model to monitor the companies’ commitment to sustainable issues by analyzing online data retrieved from their corporate websites. This information is merged with balance sheet data to examine the impact of sustainability practices, capital and human capital on productivity. The estimated firm’s total factor productivity is explained as a function of the sustainability digital footprint measures and additional control variables for a sample of 315 textile firms located in the region of Comunidad Valenciana, Spain.This work was partially funded by MCIN/AEI/10.13039/501100011033 under grant PID2019-107765RB-I00.Domenech, J.; Garcia-Bernabeu, A.; Diaz-Garcia, P. (2023). Productivity, Digital Footprint and Sustainability in the Textile and Clothing Industry. Editorial Universitat Politècnica de València. 319-326. https://doi.org/10.4995/CARMA2023.2023.1644631932

    Unfair Collection: Reclaiming Control of Publicly Available Personal Information from Data Scrapers

    Get PDF
    Rising enthusiasm for consumer data protection in the United States has resulted in several states advancing legislation to protect the privacy of their residents’ personal information. But even the newly enacted California Privacy Rights Act (CPRA)—the most comprehensive data privacy law in the country— leaves a wide-open gap for internet data scrapers to extract, share, and monetize consumers’ personal information while circumventing regulation. Allowing scrapers to evade privacy regulations comes with potentially disastrous consequences for individuals and society at large. This Note argues that even publicly available personal information should be protected from bulk collection and misappropriation by data scrapers. California should reform its privacy legislation to align with the European Union’s General Data Privacy Regulation (GDPR), which requires data scrapers to provide notice to data subjects upon the collection of their personal information regardless of its public availability. This reform could lay the groundwork for future legislation at the federal level

    Views about Balance of Multiple Interests in the Regulation of Anti Unfair Competition Law on Enterprise Data Crawling

    Get PDF
    Rapid innovation and widespread application of information and digital technologies encourage the continual growth and reconstruction of classic business models and market operation behaviors. The Internet economy has induced many new sorts of unfair competition while stimulating entrepreneurship and unleashing technological innovation dividends. Data crawling consumes a significant amount of Internet traffic because it is a cost-effective data collection strategy. It not only encourages data sharing, but it also makes unfair competition regulation more difficult. We should pay attention to assessing difficulties from the standpoint of interest concerns in the regulation of data crawling under the Anti-Unfair Competition Law, and completely measure whether it harms the interests of operators, customers, and social public interests. We can also use the method of interest measurement to coordinate the interest relationship in order to keep the competitive order and balance multiple interests
    • …
    corecore