12 research outputs found

    Evaluation of linkage-based web discovery systems

    Get PDF
    In recent years, the widespread use of the WWW has brought information retrieval systems into the homes o f many millions people. Today, we have access to many billions o f documents (web pages) and have (free-of-charge) access to powerful, fast and highly efficient search facilities over these documents provided by search engines such as Google. The "first generation" of web search engines addressed the engineering problems o f web spidering and efficient searching for large numbers o f both users and documents, but they did not innovate much in the approaches taken to searching. Recently, however, linkage analysis has been incorporated into search engine ranking strategies. Anecdotally, linkage analysis appears to have improved retrieval effectiveness o f web search, yet there is little scientific evidence in support o f the claims for better quality retrieval, which is surprising. Participants in the three most recent TREC conferences (1999, 2000 and 2001) have been invited to perform benchmarking o f information retrieval systems on web data and have had the option o f using linkage information as part of their retrieval strategies. The general consensus from the experiments of these participants is that linkage information has not yet been successfully incorporated into conventional retrieval strategies. In this thesis, we present our research into the field o f linkage-based retrieval of web documents. We illustrate that (moderate) improvements in retrieval performance is possible if the undedying test collection contains a higher link density than the test collections used in the three most recent TREC conferences. We examine the linkage structure o f live data from the WWW and coupled with our findings from crawling sections o f the WWW we present a list o f five requirements for a test collection which is to faithfully support experiments into linkage-based retrieval o f documents from the WWW. We also present some o f our own, new, vanants on linkage-based web retrieval and evaluate their performance in comparison to the approaches o f others

    Around the web in six weeks: Documenting a large-scale crawl

    Full text link

    Analysis, Modeling, and Algorithms for Scalable Web Crawling

    Get PDF
    This dissertation presents a modeling framework for the intermediate data generated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier ranking, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probability the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability

    The Impact of Novel Computing Architectures on Large-Scale Distributed Web Information Retrieval Systems

    Get PDF
    Web search engines are the most popular mean of interaction with the Web. Realizing a search engine which scales even to such issues presents many challenges. Fast crawling technology is needed to gather the Web documents. Indexing has to process hundreds of gigabytes of data efficiently. Queries have to be handled quickly, at a rate of thousands per second. As a solution, within a datacenter, services are built up from clusters of common homogeneous PCs. However, Information Retrieval (IR) has to face issues raised by the growing amount of Web data, as well as the number of new users. In response to these issues, cost-effective specialized hardware is available nowadays. In our opinion, this hardware is ideal for migrating distributed IR systems to computer clusters comprising heterogeneous processors in order to respond their need of computing power. Toward this end, we introduce K-model, a computational model to properly evaluate algorithms designed for such hardware. We study the impact of K-model rules on algorithm design. To evaluate the benefits of using K-model in evaluating algorithms, we compare the complexity of a solution built using our properly designed techniques, and the existing ones. Although in theory competitors are more efficient than us, empirically, K-model is able to prove because our solutions have been shown to be faster than the state-of-the-art implementations

    Website boundary detection via machine learning

    Get PDF
    This thesis describes research undertaken in the field of web data mining. More specifically this research is directed at investigating solutions to the Website Boundary Detection (WBD) problem. WBD is the problem of identifying the collection of all web pages that are part of a single website, which is an open problem. Potential solutions to WBD can be beneficial with respect to tasks such as archiving web content and the automated construction of web directories. A pre-requisite to any WBD approach is that of a definition of a website. This thesis commences with a discussion of previous definitions of a website, and subsequently proposes a definition of a website which is used with respect to the WBD solution approaches presented later in this thesis. The WBD problem may be addressed in either the static or the dynamic context. Both are considered in this thesis. Static approaches require all web page data to be available a priori in order to make a decision on what pages are within a website boundary. While dynamic approaches make decisions on portions of the web data, and incrementally build a representation of the pages within a website boundary. There are three main approaches to the WBD problem presented in this thesis; the first two are static approaches, and the final one is a dynamic approach. The first static approach presented in this thesis concentrates on the types of features that can be used to represent web pages. This approach presents a practical solution to the WBD problem by applying clustering algorithms to various combinations of features. Further analysis investigates the ``best'' combination of features to be used in terms of WBD performance. The second static approach investigates graph partitioning techniques based on the structural properties of the web graph in order to produce WBD solutions. Two variations of the approach are considered, a hierarchical graph partitioning technique, and a method based on minimum cuts of flow networks. The final approach for the evaluation of WBD solutions presented in this research considers the dynamic context. The proposed dynamic approach uses both structural properties and various feature representations of web pages in order to incrementally build a website boundary as the pages of the web graph are traversed. The evaluation of the approaches presented in this thesis was conducted using web graphs from four academic departments hosted by the University of Liverpool. Both the static and dynamic approaches produce appropriate WBD solutions, however. The reported evaluation suggests that the dynamic approach to resolving the WBD problem offers additional benefits over a static approach due to the lower resource cost of gathering and processing typically smaller amounts of web data

    Web modelling for web warehouse design

    Get PDF
    Tese de doutoramento em InformĂĄtica (Engenharia InformĂĄtica), apresentada Ă  Universidade de Lisboa atravĂ©s da Faculdade de CiĂȘncias, 2007Users require applications to help them obtaining knowledge from the web. However, the specific characteristics of web data make it difficult to create these applications. One possible solution to facilitate this task is to extract information from the web, transform and load it to a Web Warehouse, which provides uniform access methods for automatic processing of the data. Web Warehousing is conceptually similar to Data Warehousing approaches used to integrate relational information from databases. However, the structure of the web is very dynamic and cannot be controlled by the Warehouse designers. Web models frequently do not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late stage of development. These changes have high costs and may jeopardize entire projects. This thesis addresses the problem of modelling the web and its influence in the design of Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse prototype was designed. The prototype was validated in several real-usage scenarios. The obtained results show that web modelling is a fundamental step of the web data integration process.Os utilizadores da web recorrem a ferramentas que os ajudem a satisfazer as suas necessidades de informação. Contudo, as caracterĂ­sticas especĂ­ficas dos conteĂșdos provenientes da web dificultam o desenvolvimento destas aplicaçÔes. Uma aproximação possĂ­vel para a resolução deste problema Ă© a integração de dados provenientes da web num ArmazĂ©m de Dados Web que, por sua vez, disponibilize mĂ©todos de acesso uniformes e facilitem o processamento automĂĄtico. Um ArmazĂ©m de Dados Web Ă© conceptualmente semelhante a um ArmazĂ©m de Dados de negĂłcio. No entanto, a estrutura da informação a carregar, a web, nĂŁo pode ser controlada ou facilmente modelada pelos analistas. Os modelos da web existentes nĂŁo sĂŁo tipicamente representativos do seu estado presente. Como consequĂȘncia, os ArmazĂ©ns de Dados Web sofrem frequentemente alteraçÔes profundas no seu desenho quando jĂĄ se encontram numa fase avançada de desenvolvimento. Estas mudanças tĂȘm custos elevados e podem pĂŽr em causa a viabilidade de todo um projecto. Esta tese estuda o problema da modelação da web e a sua influĂȘncia no desenho de ArmazĂ©ns de Dados Web. Para este efeito, foi extraĂ­do um modelo de uma porção da web, e com base nele, desenhado um protĂłtipo de um ArmazĂ©m de Dados Web. Este protĂłtipo foi validado atravĂ©s da sua utilização em vĂĄrios contextos distintos. Os resultados obtidos mostram que a modelação da web deve ser considerada no processo de integração de dados da web.Fundação para Computação CientĂ­fica Nacional (FCCN); LaSIGE-LaboratĂłrio de Sistemas InformĂĄticos de Grande Escala; Fundação para a CiĂȘncia e Tecnologia (FCT), (SFRH/BD/11062/2002

    Decentralized link analysis in peer-to-peer web search networks

    Get PDF
    Analyzing the authority or reputation of entities that are connected by a graph structure and ranking these entities is an important issue that arises in the Web, in Web 2.0 communities, and in other applications. The problem is typically addressed by computing the dominant eigenvector of a matrix that is suitably derived from the underlying graph, or by performing a full spectral decomposition of the matrix. Although such analyses could be performed by a centralized server, there are good reasons that suggest running theses computations in a decentralized manner across many peers, like scalability, privacy, censorship, etc. There exist a number of approaches for speeding up the analysis by partitioning the graph into disjoint fragments. However, such methods are not suitable for a peer-to-peer network, where overlap among the fragments might occur. In addition, peer-to-peer approaches need to consider network characteristics, such as peers unaware of other peers' contents, susceptibility to malicious attacks, and network dynamics (so-called churn). In this thesis we make the following major contributions. We present JXP, a decentralized algorithm for computing authority scores of entities distributed in a peer-to-peer (P2P) network that allows peers to have overlapping content and requires no a priori knowledge of other peers' content. We also show the benets of JXP in the Minerva distributed Web search engine. We present an extension of JXP, coined TrustJXP, that contains a reputation model in order to deal with misbehaving peers. We present another extension of JXP, that handles dynamics on peer-to-peer networks, as well as an algorithm for estimating the current number of entities in the network. This thesis also presents novel methods for embedding JXP in peer-to-peer networks and applications. We present an approach for creating links among peers, forming semantic overlay networks, where peers are free to decide which connections they create and which they want to avoid based on various usefulness estimators. We show how peer-to-peer applications, like the JXP algorithm, can greatly benet from these additional semantic relations.Die Berechnung von AutoritĂ€ts- oder Reputationswerten fĂŒr Knoten eines Graphen, welcher verschiedene EntitĂ€ten verknĂŒpft, ist von großem Interesse in Web-Anwendungen, z.B. in der Analyse von Hyperlinkgraphen, Web 2.0 Portalen, sozialen Netzen und anderen Anwendungen. Die Lösung des Problems besteht oftmals im Kern aus der Berechnung des dominanten Eigenvektors einer Matrix, die vom zugrunde liegenden Graphen abgeleitet wird. Obwohl diese Analysen in einer zentralisierten Art und Weise berechnet werden können, gibt es gute GrĂŒnde, diese Berechnungen auf mehrere Knoten eines Netzwerkes zu verteilen, insbesondere bezĂŒglich Skalierbarkeit, Datenschutz und Zensur. In der Literatur finden sich einige Methoden, welche die Berechnung beschleunigen, indem der zugrunde liegende Graph in nicht ĂŒberlappende Teilgraphen zerlegt wird. Diese Annahme ist in Peer-to-Peer-System allerdings nicht realistisch, da die einzelnen Peers ihre Graphen in einer nicht synchronisierten Weise erzeugen, was inhĂ€rent zu starken oder weniger starken Überlappungen der Graphen fĂŒhrt. DarĂŒber hinaus sind Peer-to-Peer-Systeme per Definition ein lose gekoppelter Zusammenschluss verschiedener Benutzer (Peers), verteilt im ganzen Internet, so dass Netzwerkcharakteristika, Netzwerkdynamik und mögliche Attacken krimineller Benutzer unbedingt berĂŒcksichtigt werden mĂŒssen. In dieser Arbeit liefern wir die folgenden grundlegenden BeitrĂ€ge. Wir prĂ€sentieren JXP, einen verteilten Algorithmus fĂŒr die Berechnung von AutoritĂ€tsmaßen ĂŒber EntitĂ€ten in einem Peer-to-Peer Netzwerk. Wir prĂ€sentieren Trust-JXP, eine Erweiterung von JXP, ausgestattet mit einem Modell zur Berechnung von Reputationswerten, die benutzt werden, um bösartig agierende Benutzer zu identizieren. Wir betrachten, wie JXP robust gegen VerĂ€nderungen des Netzwerkes gemacht werden kann und wie die Anzahl der verschiedenen EntitĂ€ten im Netzwerk effizient geschĂ€tzt werden kann. DarĂŒber hinaus beschreiben wir in dieser Arbeit neuartige AnsĂ€tze, JXP in bestehende Peer-to-Peer-Netzwerke einzubinden. Wir prĂ€sentieren eine Methode, mit deren Hilfe Peers entscheiden können, welche Verbindungen zu anderen Peers von Nutzen sind und welche Verbindungen vermieden werden sollen. Diese Methode basiert auf verschiedenen QualitĂ€tsindikatoren, und wir zeigen, wie Peer-to-Peer-Anwendungen, zum Beispiel JXP, von diesen zusĂ€tzlichen Relationen profitieren können
    corecore