9 research outputs found

    Scuttling Web Opportunities By Application Cramming

    Get PDF
    The web contains large data and it contains innumerable websites that is monitored by a tool or a program known as Crawler. The main goal of this paper is to focus on the web forum crawling techniques. In this paper, the various techniques of web forum crawler and challenges of crawling are discussed. The paper also gives the overview of web crawling and web forums. Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant information from internet. The rapid growth of the internet poses unprecedented scaling challenges for general purpose crawlers and search engines. In this paper, we present a novel Forum Crawler under Supervision (FoCUS) method, which supervised internet-scale forum crawler. The intention of FoCUS is to crawl relevant forum information from the internet with minimal overhead, this crawler is to selectively seek out pages that are pertinent to a predefined set of topics, rather than collecting and indexing all accessible web documents to be capable to answer all possible ad-hoc questions. FoCUS is continuously keeps on crawling the internet and finds any new internet pages that have been added to the internet, pages that have been removed from the internet. Due to growing and vibrant activity of the internet; it has become more challengeable to navigate all URLs in the web documents and to handle these URLs. We will take one seed URL as input and search with a keyword, the searching result is based on keyword and it will fetch the internet pages where it will find that keywor

    Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

    Get PDF
    Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

    CALA: Classifying Links Automatically based on their URL

    Get PDF
    Web page classification refers to the problem of automatically assigning a web page to one or moreclasses after analysing its features. Automated web page classifiers have many applications, and many re- searchers have proposed techniques and tools to perform web page classification. Unfortunately, the ex- isting tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely:they require a previous extensive crawling, they are supervised, they need to download a page beforeclassifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a toolfor URL-based web page classification. The strongest features of our tool are that it does not require aprevious extensive crawling to achieve good classification results, it is unsupervised, it is based exclu- sively on URL features, which means that pages can be classified without downloading them, and it issite-, language-, and domain-independent, which makes it generally applicable. We have validated ourtool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALAis very effective and efficient in practice.Ministerio de EducaciΓ³n y Ciencia TIN2007-64119Junta de AndalucΓ­a P07-TIC-2602Junta de AndalucΓ­a P08-TIC-4100Ministerio de Ciencia e InnovaciΓ³n TIN2008-04718-EMinisterio de Ciencia e InnovaciΓ³n TIN2010-21744Ministerio de Ciencia e InnovaciΓ³n TIN2010-09809-EMinisterio de Ciencia e InnovaciΓ³n TIN2010-10811-EMinisterio de Ciencia e InnovaciΓ³n TIN2010-09988-EMinisterio de EconomΓ­a y Competitividad TIN2011-15497-EMinisterio de EconomΓ­a y Competitividad TIN2013-40848-

    Crawling deep web entity pages

    Full text link
    Deep-web crawl is concerned with the problem of surfacing hid-den content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant por-tion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query genera-tion, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are ex-perimentally evaluated and shown to be effective

    Performance improvement of user-generated data retrieval from the Web, based on adaptive intelligent methods

    Get PDF
    ΠšΠΎΡ€ΠΈΡΠ½ΠΈΡ‡ΠΊΠΈ гСнСрисан ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜ Π½Π° Π²Π΅Π± Ρ„ΠΎΡ€ΡƒΠΌΡƒ сС ΠΌΠ½ΠΎΠ³ΠΎ Ρ‡Π΅ΡˆΡ›Π΅ додајС Π½Π΅Π³ΠΎ ΡˆΡ‚ΠΎ сС Π±Ρ€ΠΈΡˆΠ΅ ΠΈΠ»ΠΈ мСња ΠΏΠ° сС самим Ρ‚ΠΈΠΌ, Ρ†ΠΈΡ™Π°ΡšΠ΅ истог, ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎΠ³ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°, Ρ€Π°Π·Π»ΠΈΠΊΡƒΡ˜Π΅ Ρƒ односу Π½Π° класично ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ страна Π²Π΅Π± ΡΠ°Ρ˜Ρ‚Π°. Π”ΠΎΠ΄Π°Π²Π°ΡšΠ΅ Π½ΠΎΠ²ΠΎΠ³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Π½Π° Ρ„ΠΎΡ€ΡƒΠΌΡƒ ΠΌΠΎΠΆΠ΅ Ρ€Π΅Π·ΡƒΠ»Ρ‚ΠΎΠ²Π°Ρ‚ΠΈ ΠΏΠΎΠΌΠ΅Ρ€Π°ΡšΠ΅ΠΌ Π²Π΅Ρ› ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›Π΅Π³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Π½Π° Π½ΠΎΠ²Π΅ ΠΈΠ»ΠΈ ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›Π΅ странС. Π˜Π½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ Ρ„ΠΎΡ€ΡƒΠΌΠ° нијС Ρ‚Ρ€ΠΈΠ²ΠΈΡ˜Π°Π»Π°Π½ Π·Π°Π΄Π°Ρ‚Π°ΠΊ, Ρ˜Π΅Ρ€ ΠΈΠ³Π½ΠΎΡ€ΠΈΡΠ°ΡšΠ΅ Π½Π°Ρ‡ΠΈΠ½Π° Π½Π° којС јС ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜ ΠΏΡ€Π΅Π·Π΅Π½Ρ‚ΠΎΠ²Π°Π½, дистрибуиран ΠΈ сортиран ΠΌΠΎΠΆΠ΅ довСсти Π΄ΠΎ прСноса постова који су Π²Π΅Ρ› Π±ΠΈΠ»ΠΈ индСксирани Ρƒ ΠΏΡ€Π΅Ρ‚Ρ…ΠΎΠ΄Π½ΠΈΠΌ циклусима ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°. Π‘ Π΄Ρ€ΡƒΠ³Π΅ странС ΠΏΠΎΡΡ‚ΠΎΡ˜ΠΈ ΡˆΠΈΡ€ΠΎΠΊ спСктар форумских Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° којС ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π°Ρ˜Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ ΠΏΡƒΡ‚Π°ΡšΠ΅ ΠΊΠ° својим најновијим постовима ΠΊΠ°ΠΎ ΠΈ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ Π½Π°Ρ‡ΠΈΠ½Π΅ ΠΏΡ€Π΅Π·Π΅Π½Ρ‚ΠΎΠ²Π°ΡšΠ° ΠΈ ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° истих. ЈСдан ΠΎΠ΄ Π³Π»Π°Π²Π½ΠΈΡ… Ρ€Π΅Π·ΡƒΠ»Ρ‚Π°Ρ‚Π° Ρ‚Π΅Π·Π΅ јС структурно Π²ΠΎΡ’Π΅Π½ΠΈ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΈ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡ Ρ„ΠΎΡ€ΡƒΠΌΠ° (SInFo) који јС ΡΠΏΠ΅Ρ†ΠΈΡ˜Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½ Π·Π° Ρ†ΠΈΡ™Π°ΡšΠ΅ најновијСг ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎΠ³ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ΠΌ Π½Π°ΠΏΡ€Π΅Π΄Π½ΠΈΡ… ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½ΠΈΡ… Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ° ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°. Π“Π»Π°Π²Π½ΠΈ Ρ†ΠΈΡ™ прСдстављСног ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡Π° Ρ˜Π΅ΡΡ‚Π΅ избСгавањС Π²Π΅Ρ› индСксираног ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Ρƒ Π½ΠΎΠ²ΠΈΠΌ циклусима ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° Ρ„ΠΎΡ€ΡƒΠΌΠ° Π±Π΅Π· ΠΎΠ±Π·ΠΈΡ€Π° Π½Π° ΡšΠ΅Π³ΠΎΠ²Ρƒ Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Ρƒ. Π”Π° Π±ΠΈ овај Ρ†ΠΈΡ™ ΠΌΠΎΠ³Π°ΠΎ Π±ΠΈΡ‚ΠΈ ΠΈΡΠΏΡƒΡšΠ΅Π½, слСдСћС карактСристикС Π²Π΅Π± Ρ„ΠΎΡ€ΡƒΠΌΠ° су ΠΈΡΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½Π΅: (1) Π½Π°Ρ‡ΠΈΠ½ ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° Π½Π° индСксним ΠΈ дискусионим странама ΠΈ (2) доступнС Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ ΠΏΡƒΡ‚Π°ΡšΠ΅ ΠΈΠ·ΠΌΠ΅Ρ’Ρƒ страна којС Ρ‚Ρ€Π΅Π½ΡƒΡ‚Π½Π° Π²Π΅Π± форумска Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° Π½ΡƒΠ΄ΠΈ. Π‘ ΠΎΠ±Π·ΠΈΡ€ΠΎΠΌ Π½Π° Ρ‚ΠΎ Π΄Π° ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΡƒΡ‚Π²Ρ€Ρ’ΠΈΠ²Π°ΡšΠ° Ρ‚ΠΈΠΏΠ° ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° Π±ΠΈΡ‚Π½Ρƒ ΡƒΠ»ΠΎΠ³Ρƒ ΠΈΠΌΠ° Π΄Π°Ρ‚ΡƒΠΌ ΠΊΡ€Π΅ΠΈΡ€Π°ΡšΠ° ΡΠ°Π΄Ρ€Π°ΠΆΠ°Ρ˜Π°, Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜Π° ΠΈ Π½ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜Π° истих нијС Ρ˜Π΅Π΄Π½ΠΎΡΡ‚Π°Π²Π°Π½ Π·Π°Π΄Π°Ρ‚Π°ΠΊ. Π—Π° овај Π·Π°Π΄Π°Ρ‚Π°ΠΊ су ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½ΠΈ ΠΌΠΎΠ΄Π΅Π»ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°, Ρ˜Π΅Ρ€ гСнСрисани Π΄Π°Ρ‚ΡƒΠΌΠΈ ΠΌΠΎΠ³Ρƒ Π±ΠΈΡ‚ΠΈ Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΠΌ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠΌΠ° ΠΈ Π½Π° Ρ€Π°Π·Π»ΠΈΡ‡Ρ‚ΠΈΠΌ Ρ˜Π΅Π·ΠΈΡ†ΠΈΠΌΠ°. Π‘ Π΄Ρ€ΡƒΠ³Π΅ странС, Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜Π° Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½ΠΈΡ… ΠΏΡƒΡ‚Π°ΡšΠ° сС постиТС ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚Π°Ρ†ΠΈΡ˜ΠΎΠΌ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Π° URL Π»ΠΈΠ½ΠΊΠΎΠ²Π° ΠΈ ΡΠΊΠ΅Π½ΠΈΡ€Π°ΡšΠ΅ΠΌ страна Π½Π° којС ΠΎΠ½ΠΈ ΡƒΠΊΠ°Π·ΡƒΡ˜Ρƒ. Показано јС Π΄Π° сС ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ΠΌ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈΡ… ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΈ Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ°, ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ Ρ†ΠΈΡ™Π°ΡšΠ° страна са најновијим ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π΅ΠΌ, ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·ΡƒΡ˜Π΅ Π±Ρ€ΠΎΡ˜ ΠΏΡ€Π΅ΡƒΠ·ΠΈΠΌΠ°ΡšΠ° Π΄ΡƒΠΏΠ»ΠΈΡ€Π°Π½ΠΎΠ³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° ΠΈ ΠΌΠ°ΠΊΡΠΈΠΌΠΈΠ·ΡƒΡ˜Π΅ ΠΈΡΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½ΠΎΡΡ‚ Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ структурС ΠΈ ΠΏΡƒΡ‚Π°ΡšΠ° Ρ‚Ρ€Π΅Π½ΡƒΡ‚Π½Π΅ Ρ„ΠΎΡ€ΡƒΠΌ Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅. ЕкспСримСнти су ΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈ Π½Π° ΡˆΠΈΡ€ΠΎΠΊΠΎΠΌ спСктру Π²Π΅Ρ› ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›ΠΈΡ… ΠΏΠΎΠΏΡƒΠ»Π°Ρ€Π½ΠΈΡ… форумских Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΊΠ°ΠΎ ΠΈ Π½Π° ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡƒΠ°Π»Π½ΠΈΠΌ stand-alone форумским Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π°ΠΌΠ°. SInFo јС ΠΏΠΎΠΊΠ°Π·Π°ΠΎ високу прСцизност ΠΈ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π°Π½ Π±Ρ€ΠΎΡ˜ прСноса Π΄ΡƒΠΏΠ»ΠΎΠ³ ΡΠ°Π΄Ρ€Π°ΠΆΠ°Ρ˜Π° Ρƒ сваком Π½ΠΎΠ²ΠΎΠΌ циклусу ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°. Π’Π΅Ρ›ΠΈΠ½Π° Π΄ΡƒΠΏΠ»ΠΈΠΊΠ°Ρ‚Π° Π½Π° којС јС ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡ Π½Π°ΠΈΠ»Π°Π·ΠΈΠΎ јС са страна којС су ΠΌΠΎΡ€Π°Π»Π΅ Π±ΠΈΡ‚ΠΈ посСћСнС ΠΊΠ°ΠΊΠΎ Π±ΠΈ сС исправно ΡƒΡ‚Π²Ρ€Π΄ΠΈΠ»Π° Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π° ΠΏΡƒΡ‚Π°ΡšΠ° ΠΈΠ»ΠΈ ΠΏΡ€ΠΎΠ½Π°ΡˆΠ°ΠΎ ΠΎΠ΄Π³ΠΎΠ²Π°Ρ€Π°Ρ˜ΡƒΡ›ΠΈ URL. Π”ΠΎΠ΄Π°Ρ‚Π½ΠΎ, ΠΌΠΎΠ΄Π΅Π»ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°, ΠΈΠ°ΠΊΠΎ су комплСксни постиТу Π΄ΠΎΠ±Ρ€Π΅ пСрформансС ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° ΠΈ ΠΈΠΌΠ°Ρ˜Ρƒ високу прСцизност Ρƒ Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜ΠΈ ΠΈ Π½ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜ΠΈ Π΄Π°Ρ‚ΡƒΠΌΠ°, достиТући F1-ΠΌΠ΅Ρ€Ρƒ ΠΎΠ΄ 99%.User-generated content on Web forums is added much more often than it is deleted or changed, so its targeting during incremental crawling differs from the Web site pages crawling. Adding new content to a forum can result in moving existing content to new or existing pages. Incremental forum crawling is not a trivial task, because ignoring in which way the content is presented, distributed and sorted can lead to the transfer of posts that have already been indexed in the previous crawl cycles. On the other hand, there is a wide spectrum of forum technologies that allow different navigational paths to its latest posts, as well as different ways of presenting and sorting user generated content. This thesis presents Structure-driven Incremental Forum crawler (SInFo) that specializes in targeting the latest content in incremental forum crawling using advanced optimization techniques and machine learning. The main goal of the presented system is to avoid already indexed content in new crawling cycles regardless of its technology. In order to achieve this, the following Web Forum features have been used: (1) the sort method on the index and thread pages and (2) the available navigation paths between the pages that the current Web Forum technology offers. Since the date of content creation plays an important role in determining the type of sort, their detection and normalization is not a trivial task. Machine learning models were used for this task, because the generated dates can be in different formats and in different languages. On the other hand, the detection of navigational paths is achieved by interpreting the URL format and scanning the pages they target. It has been shown that using the proposed methods and techniques while targeting pages with the latest content can achieve a minimum number of duplicate content downloads and maximize the utilization of the navigational structure and paths of the current forum technology. The experiments were performed on a wide range of already existing popular forum technologies as well as on individual stand-alone forum technologies. SInFo has demonstrated high precision and a minimum number of duplicate content transfers in each new crawl cycle. Most of the duplicates that the proposed system encountered are from pages that had to be visited in order to correctly determine the navigational path or to find the appropriate URL. Additionally, machine learning models, although complex, achieved good performance while crawling and have high accuracy in date detection and normalization, reaching an F1-measure of 99%

    Learning URL patterns for webpage de-duplication

    No full text

    Cross-domain Recommendations based on semantically-enhanced User Web Behavior

    Get PDF
    Information seeking in the Web can be facilitated by recommender systems that guide the users in a personalized manner to relevant resources in the large space of the possible options in the Web. This work investigates how to model people\u27s Web behavior at multiple sites and learn to predict future preferences, in order to generate relevant cross-domain recommendations. This thesis contributes with novel techniques for building cross-domain recommender systems in an open Web setting
    corecore