15,977 research outputs found

    Automatic Genre Classification in Web Pages Applied to Web Comments

    Get PDF
    Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-of-the-art techniques operating on the whole Web page text and show that accuracy can be improved significantly. Finally, we illustrate the applicability for information retrieval systems by evaluating our approach on Web pages achieved by a Web crawler

    Aplikasi Web Crawler Untuk Web Content Pada Mobile Phone

    Get PDF
    Crawling is the process behind a search engine, which served through the World Wide Web in a structured and with certain ethics. Applications that run the crawling process is called Web Crawler, also called web spider or web robot. The growth of mobile search services provider, followed by growth of a web crawler that can browse web pages in mobile content type. Crawler Web applications can be accessed by mobile devices and only web pages that type Mobile Content to be explored is the Web Crawler. Web Crawler duty is to collect a number of Mobile Content. A mobile application functions as a search application that will use the results from the Web Crawler. Crawler Web server consists of the Servlet, Mobile Content Filter and datastore. Servlet is a gateway connection between the client with the server. Datastore is the storage media crawling results. Mobile Content Filter selects a web page, only the appropriate web pages for mobile devices or with mobile content that will be forwarded

    Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

    Get PDF
    Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using NaĂŻve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low

    Web crawler research methodology

    Get PDF
    In economic and social sciences it is crucial to test theoretical models against reliable and big enough databases. The general research challenge is to build up a well-structured database that suits well to the given research question and that is cost efficient at the same time. In this paper we focus on crawler programs that proved to be an effective tool of data base building in very different problem settings. First we explain how crawler programs work and illustrate a complex research process mapping business relationships using social media information sources. In this case we illustrate how search robots can be used to collect data for mapping complex network relationship to characterize business relationships in a well defined environment. After that extend the case and present a framework of three structurally different research models where crawler programs can be applied successfully: exploration, classification and time series analysis. In the case of exploration we present findings about the Hungarian web agency industry when no previous statistical data was available about their operations. For classification we show how the top visited Hungarian web domains can be divided into predefined categories of e-business models. In the third research we used a crawler to gather the values of concrete pre-defined records containing ticket prices of low cost airlines from one single site. Based on the experiences we highlight some conceptual conclusions and opportunities of crawler based research in e-business. --e-business research,web search,web crawler,Hungarian web,social network analyis

    Crawling on the World Wide Web

    Get PDF
    As the World Wide Web grows rapidly, a web search engine is needed for people to search through the Web. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Given some seed URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its buffer and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper describes the design, implementation, and some considerations of a new crawler programmed as an learning exercise and for possible use for experimental studies

    Aplikasi Web Crawler Berdasarkan Breadth First Search dan Back-Link

    Full text link
    Web crawler, juga sering dikenal sebagai Web Spider atau Web Robot adalah salah satu komponenpenting dalam sebuah mesin pencari modern. Fungsi utama Web crawler adalah untuk melakukanpenjelajahan dan pengambilan halaman-halaman Web yang ada di Internet. Pada tulisan ini akan disajikanujicoba perbandingan algoritma penelusuran program crawler menggunakan Breadth First Search danbanyaknya backlink (Backlink Count). Pengujian berdasarkan situs www.dmoz.org dan dir.yahoo.com

    Web Crawler

    Get PDF
    Bakalářská práce se zabývá návrhem a realizací aplikace, která bude automatizovat procházení webových stránek. Hlavním bodem práce je detailní návrh reálné aplikace. Realizace je zaměřena na použití frameworků a objektově orientovaného programování. V závěru je zhodnocena realizace aplikace a jsou navrhnuty další rozšíření.This bachelor's thesis is focused on proposal and realization of an application, which will automate browsing of web pages. The head point of this thesis is detailed proposal of real application. The realization is focused on using frameworks and object oriented programming. In conclusion are mentioned reached goals and suggested next extensions of this application.
    • …
    corecore