22,390 research outputs found

    Web document classification using topic modeling based document ranking

    Get PDF
    In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible

    PEMBANGUNAN APLIKASI PENGUMPUL BERITA DARI MEDIA DARING MENGGUNAKAN WEB FRAMEWORK CODEIGNITER DAN FLASK

    Get PDF
    Advances in information and communication technology today cause access to news to be faster. Gathering news from online media is important to support various interests. News collection also needs to be done automatically to be efficient and effective. So, in this research, a news collector application will be built to get some news from online news site. The application is built using two web frameworks, namely Codeigniter and Flask. In addition, the Python-based Scrapy package is also used as a tool for web crawling and web scraping. Black-box testing is used to evaluate system functionality. Based on the results of black-box testing, it can be concluded that all functions of the news collector application from online news site that have been created can run as expected by the researcher.Abstrak— Kemajuan teknologi informasi dan komunikasi saat ini menyebabkan akses terhadap berita menjadi lebih cepat. Pengumpulan berita dari media daring menjadi hal yang penting untuk menunjang berbagai kepentingan. Pengumpulan berita juga perlu dilakukan secara otomatis supaya efisien dan efektif. Maka, dalam penelitian ini akan dibangun sebuah aplikasi pengumpul berita untuk mendapatkan berita dari media daring. Aplikasi dibangun dengan menggunakan dua buah web framework, yakni Codeigniter dan Flask. Selain itu, juga digunakan package Scrapy yang berbasis Python sebagai alat untuk melakukan web crawling dan web scraping. Black-box testing digunakan untuk mengevaluasi fungsionalitas sistem. Berdasarkan hasil dari black-box testing, dapat disimpulkan bahwa seluruh fungsi pada aplikasi pengumpul berita dari media daring yang telah dibuat dapat berjalan sesuai harapan Peneliti.   Kata kunci— berita, web scraping, web crawling, Flask, Codeigniter, Scrap

    Analysis on Web Crawling Algorithms

    Get PDF
    World Wide Web (WWW)also referred to as web acts as a vital source of information and searching over the web has become so much easy nowadays all thanks to search engines google, yahoo etc. A search engine is basically a complex multiprogram that allows user to search information available on the web and for that purpose, they use web crawlers. Web crawler systematically browses the world wide web. Effective search helps in avoiding downloading and visiting irrelevant web pages on the web in order to do that web crawlers use different searching algorithm . This paper reviews different web crawling algorithm that determines the fate of the search system

    An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

    Get PDF
    A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler

    Crawling deep web entity pages

    Full text link
    Deep-web crawl is concerned with the problem of surfacing hid-den content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant por-tion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query genera-tion, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are ex-perimentally evaluated and shown to be effective

    Practical Guides for Data Retrieval in Deep Web Crawling

    Get PDF
    Deep web crawling refers to the process of collecting documents that have been organized into a data source and can only be retrieved via a search interface. This is often achieved by sending different queries to the search interface. Dealing with the difficulty in selecting suitable set of queries, this crawling process can be implemented with stepwise refinement: documents are retrieved step by step, while in each step, we adapt the query selection to our accumulated knowledge obtained from the documents downloaded in the previous steps. However, it takes much of our time and effort to download the documents and learn from the resulting sample in order to improve the query selection. Here we propose a cost-effective, data-driven method for stepping the adaptive crawling of the deep web. Through empirical study, we explore the criteria in setting the lengths of the steps to best balance the trade-off between the sample updating cost and the improved quality of the selected queries. Derived from four existing data sets typically used for deep web crawling, such criteria provide practical guidelines for cost-effective stepwise refinement in iterative document retrieval

    CALA: Classifying Links Automatically based on their URL

    Get PDF
    Web page classification refers to the problem of automatically assigning a web page to one or moreclasses after analysing its features. Automated web page classifiers have many applications, and many re- searchers have proposed techniques and tools to perform web page classification. Unfortunately, the ex- isting tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely:they require a previous extensive crawling, they are supervised, they need to download a page beforeclassifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a toolfor URL-based web page classification. The strongest features of our tool are that it does not require aprevious extensive crawling to achieve good classification results, it is unsupervised, it is based exclu- sively on URL features, which means that pages can be classified without downloading them, and it issite-, language-, and domain-independent, which makes it generally applicable. We have validated ourtool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALAis very effective and efficient in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
    • …
    corecore