37 research outputs found

    Web Scraping the Easy Way

    Get PDF
    Web scraping refers to a software program that mimics human web surfing behavior by pointing to a website and collecting large amounts of data that would otherwise be difficult for a human to extract. A typical program will extract both unstructured and semi-structured data, as well as images, and convert the data into a structured format. Web scraping is commonly used to facilitate online price comparisons, aggregate contact information, extract online product catalog data, extract economic/demographic/statistical data, and create web mashups, among other uses. Additionally, in the era of big data, semantic analysis, and business intelligence, web scraping is the only option for data extraction as many individuals and organizations need to consume large amounts of data that reside on the web. Although many users and organizations program their own web scrapers, there are scores of freely available programs and web-browser add-ins that can facilitate web scraping. This paper demonstrates web scraping using a free program named Data Toolbar® to extract data from Amazon.com. It is hoped that the paper will expose academicians, students and practitioners to not only the concept and necessity of web scraping, but the available software as well


    Get PDF
    AbstrakThe Covid 19 pandemic has given much awareness to all people around the world about the importance of maintaining health and changing lifestyles and lifestyles to be healthier. Clear, correct and precise information is indispensable to provide insight into this respiratory virus. Digital media is widely used by the public to find links about the covid19 virus. Health topics about covid 19 from several sites will be collected by scrapping method, and the data retrieval results will be processed to become an automatic summary using Latent Semantic Analysis (LSA), where this method, will help to find the hidden meaning of a collection of sentences. The formation of the summary is assisted by the cross method. The system also has an article search to allow users to find the right information. The results of this study showed that LSA method assisted by the cross method could be used in automatic summary shrinking well, test results in f-measure and recall values on average of 90.68% and 85% with the percentage of trained data: test data is 90:10. Data collection conducted during February-June 2020 was taken 120 training documents, and 12 test documents. Testing is done with a compression rate of 30%Keywords: automatic summary, health article, scrapping, latent semantic analysis, singular value decomposition, cross methodPandemi Covid 19 telah memberikan banyak penyadaran pada seluruh masyarakat dunia mengenai pentingnya menjaga kesehatan dan merubah pola hidup dan gaya hidup menjadi lebih sehat. Informasi yang jelas, benar dan tepat sangat diperlukan untuk memberi wawasan tentang virus pernafasan ini. Media digital banyak dipakai oleh masyarakat untuk mencari tautan mengenai virus covid19. Topik kesehatan mengenai covid 19 dari beberapa situs akan dikumpulkan dengan metode scrapping, dan hasil pengambilan data akan diolah untuk menjadi sebuah ringkasan otomatis dengan menggunakan Latent Semantic Analysis(LSA), dimana metode ini, akan membantu untuk menemukan makna tersembunyi dari sebuah kumpulan kalimat.Pembentukan ringkasan dibantu dengan metode cross method. Sistem ini juga memiliki sebuah pencarian artikel, untuk membuat pengguna dapat menemukan informasi secarap tepat. Hasil dari penelitian ini menunjukan bahwa metode LSA yang dibantu dengan cross method dapat digunakan dalam penyusan ringkasan otomatis dengan baik, Hasil pengujian menghasilkan nilai f-measure dan recall rata-rata sebesar 90.68% dan 85% dengan presentase data latih: data uji adalah 90:10. Pengumpulan data dilakukan selama bulan Februari-Juni 2020 diambil 120 dokumen latih, dan 12 dokumen uji. Pengujian dilakukan dengan compression rate sebesar 30%Kata kunci: ringkasan otomatis, berita kesehatan, scrapping, Latent Semantic Analysis,Singular Value Decomposition, Cross Method

    An Overview On Web Scraping Techniques And Tools

    Get PDF
    From the evolution of WWW, the scenario of internet user and data exchange is fastly changes. As common people join the internet and start to use it, lots of new techniques are promoted to boost up the network. At the same time, to enhance computers and network facility new technologies were introduces which results into automatically decreasing in cost of hardware and website�s related costs. Due to all these changes, large number of users are joined and use the internet facilities. Daily use of internet cose in to a tremendous data is available on internet. Business, academician, researchers all are share their advertisements, information on internet so that they can be connected to people fastly and easily. As a result of exchange, share and store data on internet, a new problem is arise that how to handle such data overload and how the user will get or access the best information in least efforts. To solve this issues, researcher spotout new technique called Web Scraping. Web scraping is very imperative technique which is used to generate structured data on the basis of available unstructured data on the web. Scaping generated structured data then stored in central database and analyze in spreadsheets. Traditional copy-and-paste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analyzers are some of the common techniques used for data scraping. Previously most user uses the common copy-pest technique for gathering and analyzing data on the internet, but it is a tedious technique where lot of data copied by the user and store on computer files. As compared to this technique web scraping software is easiest scraping technique. Now a days, there are lots of software are available in the market for web scraping. Our paper is focused on the overview on the information extraction technique i.e. web scraping, different techniques of web scraping and some of the recent tools used for a web scraping

    Ferramenta de monitoramento web para apoio em observatórios de tendências: um estudo de caso Lattes.

    Get PDF
    Atualmente a internet produz um grande volume de dados diariamente. Esses dados são utilizados estrategicamente para o monitoramento de assuntos diversos, de interesse de observatórios de tendências. Os observatórios utilizam a Tecnologia da Informação e Comunicação (TIC) para auxiliar nos diagnósticos realizados. Uma das técnicas de TIC utilizadas para esse fim é a raspagem web. O objetivo deste trabalho é propor um software de monitoramento de dados públicos da internet a partir de temas de pesquisa da Embrapa Agroenergia. Foi utilizada a base de currículos do Lattes no experimento para desenvolver um software especialista em duas etapas. A primeira etapa consistiu em extrair dados da internet através de raspagem web e a segunda etapa tratou e transformou os dados brutos em informação. Os resultados forneceram insights que possibilitam a identificação de áreas de interesse, padrões de colaboração e o mapeamento da produção científica no Brasil

    Perancangan Model Sistem E-Working Paper Berbasis Web Untuk Peningkatan Kualitas Proses Audit

    Get PDF
    Audit memiliki peranan yang sangat signifikan dalam mengontrol proses bisnis di berbagai sektor. Dalam artikel ini permasalahan utama dari proses audit adalah masih rendahnya efisiensi dan efektifitas praktek audit saat ini di Indonesia. Keterbatasan teknologi informasi menjadi satu kunci utama yang menyebabkan kurang maksimalnya proses audit. Masih kurangnya ditemukan pengembangan sistem audit berbasis digital dan terintegrasi yang dapat diterapkan di Indonesia. Permasalahan seperti mahalnya biaya lisensi sebuah aplikasi sistem audit dari luar negeri akhirnya memaksa banyak auditor untuk bertahan menggunakan sistem audit berbasis dekstop yang sangat terbatas. Oleh karena itu, artikel ini memberikan analisa model untuk mengembangkan sebuah sistem kertas kerja audit elektronik berbasis WEB yang akan meningkatkan kualitas proses audit di Indonesia tanpa membutuhkan biaya lisensi yang tinggi. Hasil penelitian berupa rincian usulan kerangka, model hingga rencana kerja untuk merancang kertas kerja audit elektronik berbasis WEB yang dapat diaplikasikan secara langsung

    Visualizing networks defined by links in Wikipedia articles

    Get PDF
    A Wikipedia é um dos portais mais populares da internet, contendo mais de 40 milhões de artigos em qualquer uma das línguas em que está disponível. Os artigos da Wikipedia referenciam outros artigos por meio de hiperligações. As hiperligações traduzem a ligação e interdependência entre artigos. Neste contexto, este trabalho apresenta uma aplicação para a visualização de um grafo de conhecimento definido por hiperligações entre artigos da Wikipedia Inglesa, partindo de um artigo inicial. Dado que, em geral, o número de hiperligações dos artigos da Wikipedia é muito elevado, a aplicação baseia-se num critério natural de seleção em função da sua relevância. Os nodos do grafo obtido têm hiperligações para artigos da Wikipedia, o que proporciona um modo alternativo de navegar na Wikipedia por meio de um grafo.Wikipedia is one of the most popular websites over the Internet with more than 40 million articles in any of the languages in which it is available. Links in Wikipedia articles target related articles. Links translate connections and dependencies upon Wikipedia articles. In this context, this work presents an application to visualize a knowledge graph defined by links in English Wikipedia articles, starting from a base one. Since, in general, the number of links in Wikipedia articles is very large, the application uses a natural criterion for selecting links in terms of their relevance. Moreover, the graph nodes have hyperlinks to Wikipedia articles which gives an alternative way to browse Wikipedia.info:eu-repo/semantics/publishedVersio


    Get PDF
    In the rapid development of industry era, managing jobs from a  project have been so conveniently done. Recently, managers who create a project can follow up (to finish and to report) online. Furthermore, developers who receive that task either want to finish and report their task for their manager easily and immediately. The benefit that makes this process much easier is the whole activities can be done with the use of smartphones. In its implementation, managers give tasks for developers without knowing that the developers can do them in time or not. Therefore, this paper proposes the development of a Project Management application where every developer. The development of this application aims to make sure that a project finishes on time. This application can be of a platform to collect proofs of finished projects, as an upload function is available. This function enables users to collect files such as photo, document, or pdf

    Web Data Extraction Dalam Analitika Data Audit: Pengembangan Artefak Teknologi Dalam Perspektif Design Science Research

    Get PDF
    Perkembangan implementasi Teknologi Informasi dan Komunikasi (TIK) sebagai bagian pengendalian internal organisasi mendorong auditor mengembangkan analitika data audit (ADA/Audit Data Analytics) sebagai kerangka pengetahuan dan praktik untuk mendapatkan bukti audit dan informasi lainnya dari sekumpulan data elektronik terkait dengan pelaksanaan pada semua tahapan pekerjaan audit. Pada saat yang sama, terdapat kecenderungan organisasi untuk menyajikan datanya dengan aplikasi berbasis web. Terkait dengan keberadaan laman web sebagai sumber data (bukti audit) tersebut, telah berkembang teknik  ekstraksi data dari laman web yang disebut dengan web data extraction. Penelitian ini dengan menggunakan design science research methodology mengajukan temuan artefak yang berkaitan dengan model dan instantiasi (instantiation) web data extraction untuk implementasi ADA. Hasil penelitian ini diharapkan dapat menjadi tambahan referensi dalam ranah praktik audit berupa artefak dalam bentuk instantiasi penggunaan web data extraction untuk akusisi data sebagai bukti audit dengan sumber dari halaman web, baik dari aplikasi berbasis intranet ataupun internet. Penelitian ini juga berkontribusi dengan mengajukan kerangka praktikal implementasi web data extraction sebagai bagian dari ADA dalam melaksanakan pekerjaan audit. Selain itu, hasil kajian ini juga diharapkan menjadi referensi untuk penggunaan design science research methodology yang ternyata belum terlalu banyak diaplikasikan dalam penelitian dalam disiplin audit di Indonesia

    Monitoring System For Academic Activity Of Muhammadiyah Vocational School In Central Java

    Get PDF
    Muhammadiyah is an Islamic organization that has several levels of institutional networks, which one of them is Institution Center of Muhammadiyah in Central Java. It has a policy on the whole of activity in Central Java province, including in the education field. In implementing its policy education field, the institurion is still difficult to monitor outcomes academic activities of Muhammadiyah Vocational School throughout Central Java. The examples are to determine which schools should be given a scholarship, which schools need repairs, information about students, information about teachers and education personnel, information about facilities and infrastructure conditions of schools, etc. This happens because there is not computerized monitoring system of academic activities so that the development of Muhammadiyah Vocational School was less than the maximum. Therefore, in this study will be built a web-based monitoring system to monitor it. This monitoring system is built using waterfall method where the working each phase must be completed before proceeding to the next phase. The monitoring system is expected to facilitate the institution in monitoring outcomes academic activities of Muhammadiyah Vocational School. In addition, this system can facilitate the general public in the search for information related to education basic data and partial data of Muhammadiyah Vocational School throughout central java completely