164 research outputs found

    The Viuva Negra crawler

    Get PDF
    This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications

    Evaluation of linkage-based web discovery systems

    Get PDF
    In recent years, the widespread use of the WWW has brought information retrieval systems into the homes o f many millions people. Today, we have access to many billions o f documents (web pages) and have (free-of-charge) access to powerful, fast and highly efficient search facilities over these documents provided by search engines such as Google. The "first generation" of web search engines addressed the engineering problems o f web spidering and efficient searching for large numbers o f both users and documents, but they did not innovate much in the approaches taken to searching. Recently, however, linkage analysis has been incorporated into search engine ranking strategies. Anecdotally, linkage analysis appears to have improved retrieval effectiveness o f web search, yet there is little scientific evidence in support o f the claims for better quality retrieval, which is surprising. Participants in the three most recent TREC conferences (1999, 2000 and 2001) have been invited to perform benchmarking o f information retrieval systems on web data and have had the option o f using linkage information as part of their retrieval strategies. The general consensus from the experiments of these participants is that linkage information has not yet been successfully incorporated into conventional retrieval strategies. In this thesis, we present our research into the field o f linkage-based retrieval of web documents. We illustrate that (moderate) improvements in retrieval performance is possible if the undedying test collection contains a higher link density than the test collections used in the three most recent TREC conferences. We examine the linkage structure o f live data from the WWW and coupled with our findings from crawling sections o f the WWW we present a list o f five requirements for a test collection which is to faithfully support experiments into linkage-based retrieval o f documents from the WWW. We also present some o f our own, new, vanants on linkage-based web retrieval and evaluate their performance in comparison to the approaches o f others

    Web modelling for web warehouse design

    Get PDF
    Tese de doutoramento em InformĂĄtica (Engenharia InformĂĄtica), apresentada Ă  Universidade de Lisboa atravĂ©s da Faculdade de CiĂȘncias, 2007Users require applications to help them obtaining knowledge from the web. However, the specific characteristics of web data make it difficult to create these applications. One possible solution to facilitate this task is to extract information from the web, transform and load it to a Web Warehouse, which provides uniform access methods for automatic processing of the data. Web Warehousing is conceptually similar to Data Warehousing approaches used to integrate relational information from databases. However, the structure of the web is very dynamic and cannot be controlled by the Warehouse designers. Web models frequently do not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late stage of development. These changes have high costs and may jeopardize entire projects. This thesis addresses the problem of modelling the web and its influence in the design of Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse prototype was designed. The prototype was validated in several real-usage scenarios. The obtained results show that web modelling is a fundamental step of the web data integration process.Os utilizadores da web recorrem a ferramentas que os ajudem a satisfazer as suas necessidades de informação. Contudo, as caracterĂ­sticas especĂ­ficas dos conteĂșdos provenientes da web dificultam o desenvolvimento destas aplicaçÔes. Uma aproximação possĂ­vel para a resolução deste problema Ă© a integração de dados provenientes da web num ArmazĂ©m de Dados Web que, por sua vez, disponibilize mĂ©todos de acesso uniformes e facilitem o processamento automĂĄtico. Um ArmazĂ©m de Dados Web Ă© conceptualmente semelhante a um ArmazĂ©m de Dados de negĂłcio. No entanto, a estrutura da informação a carregar, a web, nĂŁo pode ser controlada ou facilmente modelada pelos analistas. Os modelos da web existentes nĂŁo sĂŁo tipicamente representativos do seu estado presente. Como consequĂȘncia, os ArmazĂ©ns de Dados Web sofrem frequentemente alteraçÔes profundas no seu desenho quando jĂĄ se encontram numa fase avançada de desenvolvimento. Estas mudanças tĂȘm custos elevados e podem pĂŽr em causa a viabilidade de todo um projecto. Esta tese estuda o problema da modelação da web e a sua influĂȘncia no desenho de ArmazĂ©ns de Dados Web. Para este efeito, foi extraĂ­do um modelo de uma porção da web, e com base nele, desenhado um protĂłtipo de um ArmazĂ©m de Dados Web. Este protĂłtipo foi validado atravĂ©s da sua utilização em vĂĄrios contextos distintos. Os resultados obtidos mostram que a modelação da web deve ser considerada no processo de integração de dados da web.Fundação para Computação CientĂ­fica Nacional (FCCN); LaSIGE-LaboratĂłrio de Sistemas InformĂĄticos de Grande Escala; Fundação para a CiĂȘncia e Tecnologia (FCT), (SFRH/BD/11062/2002

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time

    Term-driven E-Commerce

    Get PDF
    Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natĂŒrlichsprachige AusdrĂŒcke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres VerstĂ€ndnis natĂŒrlicher Sprache herangezogen werden

    Human exploration of complex knowledge spaces

    Get PDF
    Driven by need or curiosity, as humans we constantly act as information seekers. Whenever we work, study, play, we naturally look for information in spaces where pieces of our knowledge and culture are linked through semantic and logic relations. Nowadays, far from being just an abstraction, these information spaces are complex structures widespread and easily accessible via techno-systems: from the whole World Wide Web to the paramount example of Wikipedia. They are all information networks. How we move on these networks and how our learning experience could be made more efficient while exploring them are the key questions investigated in the present thesis. To this end concepts, tools and models from graph theory and complex systems analysis are borrowed to combine empirical observations of real behaviours of users in knowledge spaces with some theoretical findings of cognitive science research. It is investigated how the knowledge space structure can affect its own exploration in learning-type tasks, and how users do typically explore the information networks, when looking for information or following some learning paths. The research approach followed is exploratory and moves along three main lines of research. Enlarging a previous work in algorithmic education, the first contribution focuses on the topological properties of the information network and how they affect the \emph{efficiency} of a simulated learning exploration. To this end a general class of algorithms is introduced that, standing on well-established findings on educational scheduling, captures some of the behaviours of an individual moving in a knowledge space while learning. In exploring this space, learners move along connections, periodically revisiting some concepts, and sometimes jumping on very distant ones. To investigate the effect of networked information structures on the dynamics, both synthetic and real-world graphs are considered, such as subsections of Wikipedia and word-association graphs. The existence is revealed of optimal topological structures for the defined learning dynamics. They feature small-world and scale-free properties with a balance between the number of hubs and of the least connected items. Surprisingly the real-world networks analysed turn out to be close to optimality. To uncover the role of semantic content of the bit of information to be learned in a information-seeking tasks, empirical data on user traffic logs in the Wikipedia system are then considered. From these, and by means of first-order Markov chain models, some users paths over the encyclopaedia can be simulated and treated as proxies for the real paths. They are then analysed in an abstract semantic level, by mapping the individual pages into points of a semantic reduced space. Recurrent patterns along the walks emerge, even more evident when contrasted with paths originated in information-seeking goal oriented games, thus providing some hints about the unconstrained navigation of users while seeking for information. Still, different systems need to be considered to evaluate longer and more constrained and structured learning dynamics. This is the focus of the third line of investigation, in which learning paths are extracted from advances scientific textbooks and treated as they were walks suggested by their authors throughout an underlying knowledge space. Strategies to extract the paths from the textbooks are proposed, and some preliminary results are discussed on their statistical properties. Moreover, by taking advantages of the Wikipedia information network, the Kauffman theory of adjacent possible is formalized in a learning context, thus introducing the adjacent learnable to refer to the part of the knowledge space explorable by the reader as she learns new concepts by following the suggested learning path. Along this perspective, the paths are analysed as particular realizations of the knowledge space explorations, thus allowing to quantitatively contrast different approaches to education

    Acquisition des contenus intelligents dans l’archivage du Web

    Get PDF
    Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systĂšmes de gestion de contenu (CMS). Les outils actuellement utilisĂ©s par les archivistes du Web pour prĂ©server le contenu du Web collectent et stockent de maniĂšre aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structurĂ© de ces pages Web. Nous prĂ©sentons dans un premier temps un application-aware helper (AAH) qui s’intĂšgre Ă  une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, Ă©tant donnĂ©e une base de connaissance deCMS courants. L’AAH a Ă©tĂ© intĂ©grĂ©e Ă  deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriĂ©taire d’Internet Memory Foundation et une version personnalisĂ©e d’Heritrix. Nous proposons ensuite un systĂšme de crawl efficace et non supervisĂ©, ACEBot (Adaptive Crawler Bot for data Extraction), guidĂ© par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL rĂ©cupĂ©rĂ©es), apprend une stratĂ©gie de parcours basĂ©e sur l’importance des motifs de navigation (sĂ©lectionnant ceux qui mĂšnent Ă  du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un tĂ©lĂ©chargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requĂȘtes HTTP qu’un crawler gĂ©nĂ©rique, sans compromis de qualitĂ©. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de donnĂ©es semi-supervisĂ©e. OWET permet Ă  un utilisateur d’extraire les donnĂ©es cachĂ©es derriĂšre des formulaires Web
    • 

    corecore