Search CORE

164 research outputs found

Supporting the complex dynamics of the information seeking process

Author: Huurdeman H.C.
Publication venue
Publication date: 01/01/2018
Field of study

International Migration, Integration and Social Cohesion online publications

The Viuva Negra crawler

Author: Gomes Daniel
Silva Mário J.
Publication venue: Department of Informatics, University of Lisbon
Publication date: 01/11/2006
Field of study

This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications

Universidade de Lisboa: Repositório.UL

Evaluation of linkage-based web discovery systems

Author: Gurrin Cathal
Publication venue: Dublin City University. School of Computing
Publication date: 01/01/2002
Field of study

In recent years, the widespread use of the WWW has brought information retrieval systems into the homes o f many millions people. Today, we have access to many billions o f documents (web pages) and have (free-of-charge) access to powerful, fast and highly efficient search facilities over these documents provided by search engines such as Google. The "first generation" of web search engines addressed the engineering problems o f web spidering and efficient searching for large numbers o f both users and documents, but they did not innovate much in the approaches taken to searching. Recently, however, linkage analysis has been incorporated into search engine ranking strategies. Anecdotally, linkage analysis appears to have improved retrieval effectiveness o f web search, yet there is little scientific evidence in support o f the claims for better quality retrieval, which is surprising. Participants in the three most recent TREC conferences (1999, 2000 and 2001) have been invited to perform benchmarking o f information retrieval systems on web data and have had the option o f using linkage information as part of their retrieval strategies. The general consensus from the experiments of these participants is that linkage information has not yet been successfully incorporated into conventional retrieval strategies. In this thesis, we present our research into the field o f linkage-based retrieval of web documents. We illustrate that (moderate) improvements in retrieval performance is possible if the undedying test collection contains a higher link density than the test collections used in the three most recent TREC conferences. We examine the linkage structure o f live data from the WWW and coupled with our findings from crawling sections o f the WWW we present a list o f five requirements for a test collection which is to faithfully support experiments into linkage-based retrieval o f documents from the WWW. We also present some o f our own, new, vanants on linkage-based web retrieval and evaluate their performance in comparison to the approaches o f others

Irish Universities

DCU Online Research Access Service

Web modelling for web warehouse design

Author: Daniel Coelho Gomes
Daniel Coelho Gomes
Doutor Mário
Doutoramento Em Informática
Especialidade Engenharia Informática
Faculdade De Ciências
Gaspar Silva Abstract
Jorge Costa
Tese Prof
Universidade De Lisboa
Publication venue
Publication date: 01/01/2006
Field of study

Tese de doutoramento em Informática (Engenharia Informática), apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2007Users require applications to help them obtaining knowledge from the web. However, the specific characteristics of web data make it difficult to create these applications. One possible solution to facilitate this task is to extract information from the web, transform and load it to a Web Warehouse, which provides uniform access methods for automatic processing of the data. Web Warehousing is conceptually similar to Data Warehousing approaches used to integrate relational information from databases. However, the structure of the web is very dynamic and cannot be controlled by the Warehouse designers. Web models frequently do not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late stage of development. These changes have high costs and may jeopardize entire projects. This thesis addresses the problem of modelling the web and its influence in the design of Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse prototype was designed. The prototype was validated in several real-usage scenarios. The obtained results show that web modelling is a fundamental step of the web data integration process.Os utilizadores da web recorrem a ferramentas que os ajudem a satisfazer as suas necessidades de informação. Contudo, as características específicas dos conteúdos provenientes da web dificultam o desenvolvimento destas aplicações. Uma aproximação possível para a resolução deste problema é a integração de dados provenientes da web num Armazém de Dados Web que, por sua vez, disponibilize métodos de acesso uniformes e facilitem o processamento automático. Um Armazém de Dados Web é conceptualmente semelhante a um Armazém de Dados de negócio. No entanto, a estrutura da informação a carregar, a web, não pode ser controlada ou facilmente modelada pelos analistas. Os modelos da web existentes não são tipicamente representativos do seu estado presente. Como consequência, os Armazéns de Dados Web sofrem frequentemente alterações profundas no seu desenho quando já se encontram numa fase avançada de desenvolvimento. Estas mudanças têm custos elevados e podem pôr em causa a viabilidade de todo um projecto. Esta tese estuda o problema da modelação da web e a sua influência no desenho de Armazéns de Dados Web. Para este efeito, foi extraído um modelo de uma porção da web, e com base nele, desenhado um protótipo de um Armazém de Dados Web. Este protótipo foi validado através da sua utilização em vários contextos distintos. Os resultados obtidos mostram que a modelação da web deve ser considerada no processo de integração de dados da web.Fundação para Computação Científica Nacional (FCCN); LaSIGE-Laboratório de Sistemas Informáticos de Grande Escala; Fundação para a Ciência e Tecnologia (FCT), (SFRH/BD/11062/2002

CiteSeerX

Universidade de Lisboa: Repositório.UL

Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

Author: Klein Martin
Publication venue: ODU Digital Commons
Publication date: 01/07/2011
Field of study

Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time

Old Dominion University

Term-driven E-Commerce

Author: Rolletschek Gerhard
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 05/02/2007
Field of study

Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natürlichsprachige Ausdrücke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres Verständnis natürlicher Sprache herangezogen werden

Digitale Hochschulschriften der LMU

Human exploration of complex knowledge spaces

Author: Rodi GIOVANNA CHIARA
Publication venue: Politecnico di Torino
Publication date
Field of study

Driven by need or curiosity, as humans we constantly act as information seekers. Whenever we work, study, play, we naturally look for information in spaces where pieces of our knowledge and culture are linked through semantic and logic relations. Nowadays, far from being just an abstraction, these information spaces are complex structures widespread and easily accessible via techno-systems: from the whole World Wide Web to the paramount example of Wikipedia. They are all information networks. How we move on these networks and how our learning experience could be made more efficient while exploring them are the key questions investigated in the present thesis. To this end concepts, tools and models from graph theory and complex systems analysis are borrowed to combine empirical observations of real behaviours of users in knowledge spaces with some theoretical findings of cognitive science research. It is investigated how the knowledge space structure can affect its own exploration in learning-type tasks, and how users do typically explore the information networks, when looking for information or following some learning paths. The research approach followed is exploratory and moves along three main lines of research. Enlarging a previous work in algorithmic education, the first contribution focuses on the topological properties of the information network and how they affect the \emph{efficiency} of a simulated learning exploration. To this end a general class of algorithms is introduced that, standing on well-established findings on educational scheduling, captures some of the behaviours of an individual moving in a knowledge space while learning. In exploring this space, learners move along connections, periodically revisiting some concepts, and sometimes jumping on very distant ones. To investigate the effect of networked information structures on the dynamics, both synthetic and real-world graphs are considered, such as subsections of Wikipedia and word-association graphs. The existence is revealed of optimal topological structures for the defined learning dynamics. They feature small-world and scale-free properties with a balance between the number of hubs and of the least connected items. Surprisingly the real-world networks analysed turn out to be close to optimality. To uncover the role of semantic content of the bit of information to be learned in a information-seeking tasks, empirical data on user traffic logs in the Wikipedia system are then considered. From these, and by means of first-order Markov chain models, some users paths over the encyclopaedia can be simulated and treated as proxies for the real paths. They are then analysed in an abstract semantic level, by mapping the individual pages into points of a semantic reduced space. Recurrent patterns along the walks emerge, even more evident when contrasted with paths originated in information-seeking goal oriented games, thus providing some hints about the unconstrained navigation of users while seeking for information. Still, different systems need to be considered to evaluate longer and more constrained and structured learning dynamics. This is the focus of the third line of investigation, in which learning paths are extracted from advances scientific textbooks and treated as they were walks suggested by their authors throughout an underlying knowledge space. Strategies to extract the paths from the textbooks are proposed, and some preliminary results are discussed on their statistical properties. Moreover, by taking advantages of the Wikipedia information network, the Kauffman theory of adjacent possible is formalized in a learning context, thus introducing the adjacent learnable to refer to the part of the knowledge space explorable by the reader as she learns new concepts by following the suggested learning path. Along this perspective, the paths are analysed as particular realizations of the knowledge space explorations, thus allowing to quantitatively contrast different approaches to education

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Author: A Rauber
AZ Broder
HC Huurdeman
J Masanès
JM Kleinberg
M Klein
M Ángeles Serrano
N Brügger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Item does not contain fulltex

VU Research Portal

Crossref

CWI's Institutional Repository

Radboud Repository

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Author: Fuhr N.
Kovács L.
Nejdl W.
Ossenbruggen J. van
Risse T.
Samar T.
Traub M.C.
Vries A.P. de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne