163 research outputs found

    Practice of Robots Exclusion Protocol in Bhutan

    Get PDF
    Most of the search engines rely on the web robots to collect information from the web. The web is open access and unregulated which makes it easier for the robots to crawl and index all the contents of websites so easily. But not all wish to get their websites and web pages indexed by web crawlers. The diverse crawling activities can be regulated and managed by deploying the Robots Exclusion Protocol (REP) in a file called robots.txt in the server. The method used is a de-facto standard and most of the ethical robots will follow the rules specified in the file. In Bhutan, there are many websites and in order to regulate those bots, the usage of the robots.txt file in the websites are not known since no study has been carried out till date. The main aim of the paper is to investigate the use of robots.txt files in various organizations’ websites in Bhutan. And further, to analyze its content present in the file if it exist. A total of 50 websites from various sectors like colleges, government ministries, autonomous agencies, corporations and newspaper agencies were selected for the investigation to check the usage of the file. Moreover, the files were further studied and analyzed for its file size, types of robots specified, and correct use of the file. The result showed that that almost 70% of the websites investigated are using the default robots.txt file generally created by the Joomla and Word press Content Management Systems (CMS) which ultimately specifies that there is a usage of the file. But on the other hand, the file is not really taken into seriously and almost 70% of it lacks major and best protocols defined in it that will help define the access and denial to various resources to various types of robots available on the web. Approximately 30% of the URLs adopted for the study show that the REP file is not added in their web server, thus providing unregulated access of resources to all types of web robots. Keywords: Crawler, robots.txt, search engines, robots exclusion protocol, indexing DOI: 10.7176/JEP/11-35-01 Publication date: December 31st 202

    Robots Welcome? Ethical and Legal Considerations for Web Crawling and Scraping

    Get PDF
    Web crawlers are widely used software programs designed to automatically search the online universe to find and collect information. The data that crawlers provide help make sense of the vast and often chaotic nature of the Web. Crawlers find websites and content that power search engines and online marketplaces. As people and organizations put an ever-increasing amount of information online, tech companies and researchers deploy more advanced algorithms that feed on that data. Even governments and law enforcement now use crawlers to carry out their missions. Despite the ubiquity of crawlers, their use is ambiguously regulated largely by online social norms whereby webpage headers signal whether automated “robots” are welcome to crawl their sites. As courts take on the issues raised by web crawlers, user privacy hangs in the balance. In August 2017, the Northern District of California granted a preliminary injunction in such a case, deciding that LinkedIn’s website must be open to such crawlers. In March 2018, the District Court for the District of Columbia granted standing for an as-applied challenge to the Computer Fraud and Abuse Act to a group of academic researchers and a news organization. The Court allowed them to proceed with a case in which they now allege the law’s making a violation of website Terms of Service a crime effectively prohibits web crawling and infringes on their First Amendment Rights. In addition, news media is inundated with stories like Cambridge Analytica wherein web crawlers were used to scrape data from millions of Facebook accounts for political purposes. This paper discusses the history of web crawlers in courts as well as the uses of such programs by a wide array of actors. It addresses ethical and legal issues surrounding the crawling and scraping of data posted online for uses not intended by the original poster or by the website on which the information is hosted. The article further suggests that stronger rules are necessary to protect the users’ initial expectations about how their data would be used, as well as their privacy

    Web Site Metadata

    Full text link
    The currently established formats for how a Web site can publish metadata about a site's pages, the robots.txt file and sitemaps, focus on how to provide information to crawlers about where to not go and where to go on a site. This is sufficient as input for crawlers, but does not allow Web sites to publish richer metadata about their site's structure, such as the navigational structure. This paper looks at the availability of Web site metadata on today's Web in terms of available information resources and quantitative aspects of their contents. Such an analysis of the available Web site metadata not only makes it easier to understand what data is available today; it also serves as the foundation for investigating what kind of information retrieval processes could be driven by that data, and what additional data could be provided by Web sites if they had richer data formats to publish metadata

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

    A Comparison of Internet Marketing Methods Utilized By Higher Education Institutions

    Get PDF
    Use of Internet marketing techniques in higher education to attract prospective students is relatively new. While the research is recent, there are several studies that identify what is most valuable to students seeking information on college web sites. Higher education is now facing increasing competition from for-profit schools and reduced funding from typical sources. This study examines how two different Carnegie classifications use Internet marketing techniques and identifies if there is a difference in how much they use these techniques. Content techniques and search engine optimization (SEO) techniques were examined for 56 higher education institutions in this study. Of the 14 internet marketing techniques that were studied, seven were focused on the content that prospective students identify as valuable to them and the other seven were focused on the techniques that are recognized as important SEO techniques to improve web page visibility on search engines. The seven content focused internet marketing techniques were developed based on multiple studies that identified what prospective students value most on a higher education web site. The content techniques that were examined include online applications, a cost calculator, online course information, admissions contact information, online visit requests, mail information requests, and student focused navigation. The seven SEO focused internet marketing techniques were developed from journals, guidelines provided by search engines, and SEO experts. The SEO techniques that were examined included use of the H1 header tag, page titles, description meta tags, relevant keywords, user friendly page addresses, Sitemap.XML files, and Robots.TXT files

    The use of robots.txt and sitemaps in the Spanish public administration

    Get PDF
    Robots.txt and sitemaps files are the main methods to regulate search engine crawler access to its content. This article explain the importance of such files and analyze robots.txt and sitemaps from more than 4,000 web sites belonging to spanish public administration to determine the use of these files as a medium of optimization for crawlers

    IN RE: Complaint of Judicial Misconduct

    Get PDF

    Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems

    Get PDF
    Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.O contínuo desenvolvimento social e económico tem conduzido ao longo do tempo a um aumento do consumo, assim como a uma maior exigência do consumidor por produtos melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel fundamental na decisão de compra por parte de um consumidor. Nesse sentido, as lojas online precisam de analisar e definir qual o melhor preço para cada produto, tendo como base diversos fatores, tais como o custo de produção/venda, posicionamento do produto (e.g. produto âncora) e as próprias estratégias das empresas concorrentes. O trabalho dos analistas de mercado mudou drasticamente nos últimos anos. O crescimento de sites na Web tem sido exponencial, o número de sites E-commerce também tem prosperado. A classificação de páginas da Web torna-se cada vez mais importante, especialmente em campos como mineração de dados na Web e coleta/extração de informações. Os classificadores tradicionais são geralmente feitos manualmente e não adaptativos, o que os torna inadequados num contexto mais amplo. Nós introduzimos um conjunto de métodos e o estudo posterior dos seus resultados para criar um crawler e scraper mais genéricos e modulares para extração de conhecimento em páginas de Ecommerce. A informação recolhida pode então ser processada e utilizada na tomada de decisão sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito extrair conhecimento de Web sites de E-commerce. Este processo necessita realizar a navegação sobre lojas online e armazenar páginas de produto. Isto implica que dado uma página web a framework seja capaz de determinar se é uma página de produto. Para atingir este objetivo nós classificamos as páginas em três categorias: catálogo, produto e spam. A classificação das páginas foi realizada tendo em conta o html e o aspeto visual das páginas, utilizando tanto métodos tradicionais como Deep Learning. Depois de identificar um conjunto de páginas de produto procedemos à extração de informação sobre o preço. Este processo não é trivial devido à quantidade de abordagens possíveis para criar uma página web. A maioria dos produtos são dinâmicos no sentido em que um produto é na realidade uma família de produtos relacionados. Por exemplo, quando visitamos uma loja online de sapatos, para um modelo em especifico existe a provavelmente um conjunto de tamanhos e cores disponíveis. Esse modelo pode ser apresentado numa única página dinâmica fazendo com que seja necessário para a nossa Framework explorar estas combinações relevantes. Este processo é chamado de scraping e é o último passo da Framework Prometheus

    Web manifestations of knowledge-based innovation systems in the UK

    Get PDF
    Innovation is widely recognised as essential to the modern economy. The term knowledgebased innovation system has been used to refer to innovation systems which recognise the importance of an economy’s knowledge base and the efficient interactions between important actors from the different sectors of society. Such interactions are thought to enable greater innovation by the system as a whole. Whilst it may not be possible to fully understand all the complex relationships involved within knowledge-based innovation systems, within the field of informetrics bibliometric methodologies have emerged that allows us to analyse some of the relationships that contribute to the innovation process. However, due to the limitations in traditional bibliometric sources it is important to investigate new potential sources of information. The web is one such source. This thesis documents an investigation into the potential of the web to provide information about knowledge-based innovation systems in the United Kingdom. Within this thesis the link analysis methodologies that have previously been successfully applied to investigations of the academic community (Thelwall, 2004a) are applied to organisations from different sections of society to determine whether link analysis of the web can provide a new source of information about knowledge-based innovation systems in the UK. This study makes the case that data may be collected ethically to provide information about the interconnections between web sites of various different sizes and from within different sectors of society, that there are significant differences in the linking practices of web sites within different sectors, and that reciprocal links provide a better indication of collaboration than uni-directional web links. Most importantly the study shows that the web provides new information about the relationships between organisations, rather than just a repetition of the same information from an alternative source. Whilst the study has shown that there is a lot of potential for the web as a source of information on knowledge-based innovation systems, the same richness that makes it such a potentially useful source makes applications of large scale studies very labour intensive.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore