12 research outputs found

    Crawler Technology Based on Scrapy Framework

    Get PDF

    Perbandingan Metode Web Scraping Menggunakan CSS Selector dan Xpath Selector

    Get PDF
    Pemanfaatan data atau berita yang tersebar di internet untuk meningkatkan peluang keberhasilan dalam sebuah usaha melalui analisa trend pasar adalah hal yang sangat umum pada saat ini. Penjelajahan Web (Crawl) dan ekstraksi data dari web (Scraping) menjadi salah satu hal yang penting, agar tidak terjadi data yang kurang sempurna, dan data yang diterima adalah data yang paling baru. CSS Selector dan Xpath merupakan salah satu metode yang umum digunakan dalam melakukan proses crawling. Terdapat perbedaan dari jumlah data yang terambil, besar file output dan waktu pemrosesan dari kedua metode tersebut, dimana Xpath memiliki keunggulan pada jumlah data yang terambil dan waktu pemrosesnya yang berakibat pada ukuran file output yang lebih besar. Sedangkan untuk penggunaan memori pada kedua metode pada proses crawling tidak memiliki perbedaan yang signifikan

    Application Research of Crawler and Data Analysis Based on Python

    Get PDF

    Pemanfaatan News Crawling Untuk Pembangunan Corpus Berita Menggunakan Scrapy dan Xpath

    Get PDF
    Linguistically, language corpus is a collection of written (textual) or test hypotheses about language structure. However, the existence of the language corpus, especially the Indonesian corpus today is still very less. It's caused by the use of language corpus for Natural Language Processing is rare and most of them still using the same corpus which is used by previous research. In addition, the construction of the corpus itself takes a long time and big costs. To overcome this problem, this research proposed a development of language corpus, especially Indonesian corpus, using web crawling engine Scrapy and guided X-path. So with the use of guided web crawling technology is expected to build a corpus language data in accordance with the needs of research and net of unexpected codes and links without much time and energy consuming. The result shows that the development of news corpus using Scrapy and Xpath is successfully meet the expected target. This is characterized by the resulting corpus news that has been divided into three categories of news namely, entertainment, community and culinary news. In addition, from the parameters tested it can be concluded that the use of resources on the server computer is directly proportional to the number of items obtained and the file size. This means that the more items obtained and successfully stored the greater the size of the file and resource memory used. Thus, to limit memory usage on server computers, we can limit what items will be taken at the time of the scraping process by limiting the number of links crawled by the spider or limiting the number of items to be searched.  Keywords— Language Corpus, Natural Language Processing, Scrapy, Web Crawling, XPathÂ

    Financial Banking Dataset for Supervised Machine Learning Classification

    Get PDF
    Social media has opened new avenues and opportunities for financial banking institutions to improve the quality of their products and services and to understand and to adapt to their customers' needs. By directly analyzing the feedback of its customers, financial banking institutions can provide personalized products and services tailored to their customer needs. This paper presents a research framework for creation of a financial banking dataset in order to be used for Sentiment Classification using various Machine Learning methods and techniques. The dataset contains 2234 financial banking comments from Romanian financial banking social media collected via web scraping technique

    Um Sistema de Aquisi\c{c}\~ao e An\'alise de Dados para Extra\c{c}\~ao de Conhecimento da Plataforma Ebit

    Full text link
    The internet development and the consequent change in communication forms have strengthened as online social networks, increasing the involvement of people with this media and making consumers of products and services, which are more informed and demanding for companies. This context has given rise to Social CRM, which can be put into practice by means of electronic word of mouth platforms, enable web sharing of comments and evaluations about companies, defining their reputation. However, most electronic word of mouth platforms do not provide information for extracting your information, making it difficult to analyze the data. To satisfy this gap, a system was developed to capture and automatically summarize the data of the companies registered in the eBit platform.Comment: in Portuguese, Paper presented at the 15th International Conference On Information Systems & Technology Managemen

    Extração automática de documentos médicos da web para análise textual

    Get PDF
    Dissertação de mestrado integrado em Engenharia Biomédica (especialização em Informática Médica)A literatura científica na biomedicina é um elemento fundamental no processo de obtenção de conhecimento, uma vez que é a maior e mais confiável fonte de informação. Com os avanços tecnológicos e o aumento da competição profissional, o volume e diversidade de documentos médicos científicos tem vindo a aumentar consideravelmente, impedindo que os investigadores acompanhem o crescimento da bibliografia. Para contornar esta situação e reduzir o tempo gasto pelos profissionais na extração dos dados e na revisão da literatura, surgiram os conceitos de Web Crawling, Web Scraping e Processamento de Linguagem Natural, que permitem, respetivamente, a procura, extração e processamento automático de grandes quantidades de texto, abrangendo uma maior gama de documentos científicos do que os normalmente analisados de forma manual. O trabalho desenvolvido para a presente dissertação teve como foco principal o rastreamento e recolha de documentos científicos completos, do campo da biomedicina. Como a maioria dos repositórios da web não disponibiliza, gratuitamente, a totalidade de um documento, mas sim apenas o resumo da publicação, foi importante a seleção de uma base de dados adequada. Por este motivo, as páginas web alvo de rastreamento foram restringidas ao domínio dos repositórios da editora BioMed Central, que disponibilizam por completo, milhares de documentos científicos na área da biomedicina. A arquitetura do sistema desenvolvido divide-se em duas partes principais: fase online e a fase offline. A primeira inclui a procura e extração dos URLs das páginas candidatas a serem extraídas, a recolha dos campos de texto pretendidos e o seu armazenamento numa base de dados. A segunda fase consiste no tratamento e limpeza dos documentos recolhidos, deixando-os num formato estruturado e válido para ser utilizado como entrada de qualquer sistema de análise de texto. Para a concretização da primeira parte, foram utilizadas a framework Scrapy, como base para a construção do scraper, e a base de dados de documentos MongoDB, para o armazenamento das publicações científicas recolhidas. Na segunda etapa do processo, ou seja, na aplicação de técnicas de limpeza e padronização dos dados, foram aproveitadas algumas das inúmeras bibliotecas e funcionalidades que a linguagem Python oferece. Para demonstrar o funcionamento do sistema de extração e tratamento de documentos da área médica, foi estudado o caso prático de recolha de publicações científicas relacionadas com Transtornos Obsessivo Compulsivos. Como resultado de todo o procedimento, foi obtida uma base de dados com quatro coleções de documentos com diferentes níveis de processamento.The scientific literature in biomedicine is a fundamental element in the process of obtaining knowledge, since it is the largest and most reliable source of information. With technological advances and increasing professional competition, the volume and diversity of scientific medical documents increased considerably, preventing researchers from keeping up with the growth of bibliography. To circumvent this situation and reduce the time spent by professionals in data extraction and literature review, the concepts of web crawling, web scraping and natural language processing have emerged, which allow, respectively, the search, extraction and automatic processing of large text, covering a wider range of scientific documents than those normally handled. The work developed for the present dissertation focused on the crawling and collection of complete scientific documents from the field of biomedicine. As most web repositories do not provide the entire document for free, but only the abstract of the publication, it was important to select an appropriate database. For this reason, the crawled web pages have been restricted to the domain of BioMed Central repositories, which provide thousands of scientific papers in the field of biomedicine. The system architecture in question is divided into two main parts: the online phase and the offline phase. The first one includes searching and extracting the URLs of the candidate pages to be extracted, collecting the desired text fields and storing them in a database. The second phase is the handling and cleaning of the collected documents, leaving them in a structured and valid format to be used as input to any text analysis system. For the realization of the first part, it was used the Scrapy framework as the basis for the construction of the scraper and the MongoDB document database for storing the collected scientific publications. In the second step of the process, that is, for the application of data cleaning and standardization techniques, some of the numerous libraries and functionalities that the Python language offers are taken advantage of. In order to demonstrate the operation of the document extraction system, the practical case of collecting scientific publications related to Obsessive Compulsive Disorders was studied. As a result of the entire procedure, a database with four document collections with different processing levels was obtained

    Study of the long tail formation within an ewom community. The case of ciao UK

    Get PDF
    Continuous communication among people and ubiquitous online access are fundamental characteristics of online eWOM communities that are facilitating the distribution of a broad range of products and services. eWOM communities have emerged to influence customers directly and create interest with efficacy and flexibility in spite of geographic boundaries (Duan, Gu and Whinston 2008). They provide rich and objective product information that is influencing customers’ decision making (Gu, Tang and Whinston 2013, Kim and Gupta 2009, Zacharia, Moukas and Maes 2000), due to the credibility, empathy and relevance they offer to customers as opposed to the information provided by marketer-designed websites (Bickart and Schindler 2001). Through eWOM, users can freely post their reviews about any product or service, and share those reviews with other users in order to better understand a product (Hennig-Thurau, et al. 2004). Thus, through eWOM communities, a great audience of users is able to acquire knowledge from reviews concerning products and services that are less popular to the majority. In that respect, the distribution of product sales is changing due to the increment of product information available to consumers (Brynjolfsson, Hu, & Smith, 2010) facilitating the long tail phenomenon (Anderson 2004). Many authors have given a good understanding of the main idea behind long tail within sales distributions in product markets such as Amazon (Brynjolfsson, Hu, & Smith, 2003; Brynjolfsson, Hu, & Smith, 2010). However, this Thesis goes beyond and applies new methodologies –elbow criterion– and extends others –power-law distribution– by Clauset, Shalizi and Newman (2009) to mathematically measure the long tail in other environments, such as the eWOM community Ciao. Whereas most eWOM studies focus just on the potential of eWOM facilitating the long tail effect to find rare or niche products (Hennig-Thurau, Gwinner, Walsh, & Gremler, 2004; Khammash & Griffiths, 2011) and how eWOM is enabling zero-cost dissemination of information about products (Odić, Tkalčič, Tasič, & Košir, 2013) and so forth, not many noticed that for each product type enclosed in the tail of the sales distribution there might be different impacts. In this regard, the results within this Thesis might indicate that vendors could adopt alternative product strategies depending on with which niche product type (search or experience good) the tails of sales distribution would be formed. More specifically, this Thesis proposes an approach for detecting whether there is a long tail for each product type and thus, cases should be differentiated when niche products represent a significant portion of overall product sales. Likewise, given the volume of the user-generated content in the web and its speed of change this Thesis also presents two important highlights in this regard. First, the implementation of an effective web crawler that can gather and identify big amounts of user-generated content. Second, the stages followed on this crawling process, which are the identification and collection of important data, and the maintenance of the gathered data. Consequently, social science needs to develop adequate methodologies to deal with huge amounts of data, such as the one outlined within this Thesis and overcome the distance between technology and social sciences. The chosen methodology within this Thesis has been to triangulate the method of power-law distribution of data gathered with other method, the elbow criterion in order to identify the long tail. That is, to compare the all the type of products among the eWOM Ciao UK, the probability power-law distribution function was represented as a tool to measure the long tail. Besides, to extra validate such method the elbow criterion was also used to identify where was located the optimal cut-off point that distinguishes the products characterized by the long tail. Furthermore, this Thesis outlines an architectural framework and methodology to gather user-generated data the eWOM community Ciao UK. To that end, a new methodology describes the implementation of a web crawler from other disciplinary perspective: the computing science discipline. Interestingly, the present thesis aims to contribute to the study of the long tail phenomenon in an eWOM community and what product types are enclosed there. To this end the three following hypotheses where contrasted: H1: The experience products from the distribution of product categories within an eWOM are more likely to exhibit a long tail. H2: The search products from the distribution of product categories within an eWOM are less likely to exhibit a long tail. H3: The distribution of product categories within an eWOM that have high frequency events or super-hits in the short head are not particularly associated with search or experience products. The results supported all the three proposed hypotheses. In this sense, this Thesis presents important new findings. Firstly, it is evidenced that products having a long tail are those with subjective evaluation standards, which are classified as experience products. Secondly, it is also corroborated that search products, which have a high level of objective attributes in the total product assessment do not encourage the long tail phenomenon. Thirdly, there is a combination of products when there are super-hits in the short head of the distribution. Thus, those are not particularly associated with search or experience products since they contain either objective or subjective evaluation standards. Finally, it is also remarkable to highlight that not all the categories fitting a power-law distribution are characterized by a long tail and on the contrary, some of those having a long tail do not fit a power-law. In general, the findings also suggest the potentials of eWOM, which, in general, might generate a long tail effect, where a large number of small-volume vendors coexist with a few high-volume ones. Furthermore, this Thesis has contributed to both theory and practice, essentially, in three different ways: (1) with a methodology of collection of online user-generated data in the context of social sciences; (2) with the development of two more accurate methods to identify niche products within an eWOM community, providing a deeper understanding of the long tail phenomena and the type of products; and (3) with publications of refereed journals papers (indexed in JCR/JSCR) as well as conference papers related to the main topic of this Thesis.Premio Extraordinario de Doctorado U

    Integration of RFID and Industrial WSNs to Create A Smart Industrial Environment

    Get PDF
    A smart environment is a physical space that is seamlessly embedded with sensors, actuators, displays, and computing devices, connected through communication networks for data collection, to enable various pervasive applications. Radio frequency identification (RFID) and Wireless Sensor Networks (WSNs) can be used to create such smart environments, performing sensing, data acquisition, and communication functions, and thus connecting physical devices together to form a smart environment. This thesis first examines the features and requirements a smart industrial environment. It then focuses on the realization of such an environment by integrating RFID and industrial WSNs. ISA100.11a protocol is considered in particular for WSNs, while High Frequency RFID is considered for this thesis. This thesis describes designs and implementation of the hardware and software architecture necessary for proper integration of RFID and WSN systems. The hardware architecture focuses on communication interface and AI/AO interface circuit design; while the driver of the interface is implemented through embedded software. Through Web-based Human Machine Interface (HMI), the industrial users can monitor the process parameters, as well as send any necessary alarm information. In addition, a standard Mongo database is designed, allowing access to historical and current data to gain a more in-depth understanding of the environment being created. The information can therefore be uploaded to an IoT Cloud platform for easy access and storage. Four scenarios for smart industrial environments are mimicked and tested in a laboratory to demonstrate the proposed integrated system. The experimental results have showed that the communication from RFID reader to WSN node and the real-time wireless transmission of the integrated system meet design requirements. In addition, compared to a traditional wired PLC system where measurement error of the integrated system is less than 1%. The experimental results are thus satisfactory, and the design specifications have been achieved
    corecore