Search CORE

9 research outputs found

Scuttling Web Opportunities By Application Cramming

Author: Alahari Hanumat Prasad
Dhulipalla Vijaya Sree
Publication venue: Kakinada Institute of Engineering and Technology for Women
Publication date: 28/10/2014
Field of study

The web contains large data and it contains innumerable websites that is monitored by a tool or a program known as Crawler. The main goal of this paper is to focus on the web forum crawling techniques. In this paper, the various techniques of web forum crawler and challenges of crawling are discussed. The paper also gives the overview of web crawling and web forums. Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant information from internet. The rapid growth of the internet poses unprecedented scaling challenges for general purpose crawlers and search engines. In this paper, we present a novel Forum Crawler under Supervision (FoCUS) method, which supervised internet-scale forum crawler. The intention of FoCUS is to crawl relevant forum information from the internet with minimal overhead, this crawler is to selectively seek out pages that are pertinent to a predefined set of topics, rather than collecting and indexing all accessible web documents to be capable to answer all possible ad-hoc questions. FoCUS is continuously keeps on crawling the internet and finds any new internet pages that have been added to the internet, pages that have been removed from the internet. Due to growing and vibrant activity of the internet; it has become more challengeable to navigate all URLs in the web documents and to handle these URLs. We will take one seed URL as input and search with a keyword, the searching result is based on keyword and it will fetch the internet pages where it will find that keywor

International Journal of Science Engineering and Advance Technology (IJSEAT)

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Author: Cardoso Jorge
Demidova Elena
Gossen Gerhard
Guerra Francesco
Holzmann Helge
Houben Geert-Jan
Pinto Alexandre Miguel
Risse Thomas
Souza Tarcisio
Szymanski Julian
Velegrakis Yannis
Publication venue: Berlin ; Heidelberg : Springer
Publication date: 01/01/2016
Field of study

Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

Institutionelles Repositorium der Leibniz Universität Hannover

CALA: Classifying Links Automatically based on their URL

Author: Corchuelo Gil Rafael
Hernández Salmerón Inmaculada Concepción
Rivero Carlos R.
Ruiz Cortés David
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Web page classification refers to the problem of automatically assigning a web page to one or moreclasses after analysing its features. Automated web page classifiers have many applications, and many re- searchers have proposed techniques and tools to perform web page classification. Unfortunately, the ex- isting tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely:they require a previous extensive crawling, they are supervised, they need to download a page beforeclassifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a toolfor URL-based web page classification. The strongest features of our tool are that it does not require aprevious extensive crawling to achieve good classification results, it is unsupervised, it is based exclu- sively on URL features, which means that pages can be classified without downloading them, and it issite-, language-, and domain-independent, which makes it generally applicable. We have validated ourtool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALAis very effective and efficient in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

Crossref

idUS. Depósito de Investigación Universidad de Sevilla

Crawling deep web entity pages

Author: Dong Xin
Nirav Shah
Sriram Rajaraman
Venkatesh Ganti
Yeye He
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Deep-web crawl is concerned with the problem of surfacing hid-den content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant por-tion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query genera-tion, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are ex-perimentally evaluated and shown to be effective

CiteSeerX

Crossref

Performance improvement of user-generated data retrieval from the Web, based on adaptive intelligent methods

Author: Pavković Miloš
Publication venue: Универзитет у Београду, Електротехнички факултет
Publication date: 18/02/2021
Field of study

Кориснички генерисан садржај на веб форуму се много чешће додаје него што се брише или мења па се самим тим, циљање истог, приликом инкременталног претраживања, разликује у односу на класично претраживање страна веб сајта. Додавање новог садржаја на форуму може резултовати померањем већ постојећег садржаја на нове или постојеће стране. Инкрементално претраживање форума није тривијалан задатак, јер игнорисање начина на које је садржај презентован, дистрибуиран и сортиран може довести до преноса постова који су већ били индексирани у претходним циклусима претраживања. С друге стране постоји широк спектар форумских технологија које омогућавају различите навигационе путање ка својим најновијим постовима као и различите начине презентовања и сортирања истих. Један од главних резултата тезе је структурно вођени инкрементални претраживач форума (SInFo) који је специјализован за циљање најновијег садржаја приликом инкременталног претраживања коришћењем напредних оптимизационих техника и машинског учења. Главни циљ представљеног претраживача јесте избегавање већ индексираног садржаја у новим циклусима претраживања форума без обзира на његову технологију. Да би овај циљ могао бити испуњен, следеће карактеристике веб форума су искоришћене: (1) начин сортирања на индексним и дискусионим странама и (2) доступне навигационе путање између страна које тренутна веб форумска технологија нуди. С обзиром на то да приликом утврђивања типа сортирања битну улогу има датум креирања садражаја, детекција и нормализација истих није једноставан задатак. За овај задатак су коришћени модели машинског учења, јер генерисани датуми могу бити у различитим форматима и на различтим језицима. С друге стране, детекција навигационих путања се постиже интерпретацијом формата URL линкова и скенирањем страна на које они указују. Показано је да се коришћењем предложених метода и техника, приликом циљања страна са најновијим садржајем, минимизује број преузимања дуплираног садржаја и максимизује искоришћеност навигационе структуре и путања тренутне форум технологије. Експерименти су изведени на широком спектру већ постојећих популарних форумских технологија као и на индивидуалним stand-alone форумским технологијама. SInFo је показао високу прецизност и минималан број преноса дуплог садражаја у сваком новом циклусу претраживања. Већина дупликата на које је предложени претраживач наилазио је са страна које су морале бити посећене како би се исправно утврдила навигациона путања или пронашао одговарајући URL. Додатно, модели машинског учења, иако су комплексни постижу добре перформансе приликом претраживања и имају високу прецизност у детекцији и нормализацији датума, достижући F1-меру од 99%.User-generated content on Web forums is added much more often than it is deleted or changed, so its targeting during incremental crawling differs from the Web site pages crawling. Adding new content to a forum can result in moving existing content to new or existing pages. Incremental forum crawling is not a trivial task, because ignoring in which way the content is presented, distributed and sorted can lead to the transfer of posts that have already been indexed in the previous crawl cycles. On the other hand, there is a wide spectrum of forum technologies that allow different navigational paths to its latest posts, as well as different ways of presenting and sorting user generated content. This thesis presents Structure-driven Incremental Forum crawler (SInFo) that specializes in targeting the latest content in incremental forum crawling using advanced optimization techniques and machine learning. The main goal of the presented system is to avoid already indexed content in new crawling cycles regardless of its technology. In order to achieve this, the following Web Forum features have been used: (1) the sort method on the index and thread pages and (2) the available navigation paths between the pages that the current Web Forum technology offers. Since the date of content creation plays an important role in determining the type of sort, their detection and normalization is not a trivial task. Machine learning models were used for this task, because the generated dates can be in different formats and in different languages. On the other hand, the detection of navigational paths is achieved by interpreting the URL format and scanning the pages they target. It has been shown that using the proposed methods and techniques while targeting pages with the latest content can achieve a minimum number of duplicate content downloads and maximize the utilization of the navigational structure and paths of the current forum technology. The experiments were performed on a wide range of already existing popular forum technologies as well as on individual stand-alone forum technologies. SInFo has demonstrated high precision and a minimum number of duplicate content transfers in each new crawl cycle. Most of the duplicates that the proposed system encountered are from pages that had to be visited in order to correctly determine the navigational path or to find the appropriate URL. Additionally, machine learning models, although complex, achieved good performance while crawling and have high accuracy in date detection and normalization, reaching an F1-measure of 99%

National Repository of Dissertations in Serbia (NaRDuS)

Nardus

Learning URL patterns for webpage de-duplication

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Crossref

Cross-domain Recommendations based on semantically-enhanced User Web Behavior

Author: Hoxha Julia
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

Information seeking in the Web can be facilitated by recommender systems that guide the users in a personalized manner to relevant resources in the large space of the possible options in the Web. This work investigates how to model people\u27s Web behavior at multiple sites and learn to predict future preferences, in order to generate relevant cross-domain recommendations. This thesis contributes with novel techniques for building cross-domain recommender systems in an open Web setting

KITopen